ELIFE-ASU / PyInform

A Python Wrapper for the Inform Information Analysis Library
https://elife-asu.github.io/PyInform
MIT License
45 stars 9 forks source link

Concept of Time Series #40

Closed jungla88 closed 1 year ago

jungla88 commented 1 year ago

Hi,

I have not clear if a sequence in input to a method that require probability distribution automatically estimate the empirical distributions of the input data. For example Mutual Information requires 2 np.array but I could not find where the empirical distribution is estimated. I also investigated C backend but again I have not found anything useful. Could you provide some information about this process? I am asking about this because I am experiencing InformError: an inform error occurred - "negative state in timeseries". I read a previous issue for such error and the answer was to use coalesce_series but I am not sure if it is correctly to apply it to continuos timeseries.

jakehanson commented 1 year ago

Hi Jungla,

Distributions are built from time series using the Dist class.

If you have a time series with continuous values, you will have to bin it since mutual info requires a discrete state space. You can bin the time series in several different ways: https://elife-asu.github.io/PyInform/utils.html?highlight=binning#module-pyinform.utils.binning.

My recommendation would be to bin the time series into a fixed number of states first, then use coalesce_series to get rid of negative values. You will also want to check how sensitive your final result is to the number of bins you choose.

jungla88 commented 1 year ago

Hi jake,

thank you so much your reply. Actually, I followed exactly your idea about binning the series. My only concern is about coalesce_series: I am not totally sure but I think that after binning all negative values should be resolved according to the binning process since this map continuos values into positive integer discrete state space. Please correct me if I am wrong. Furthermore, should I apply Dist to the output of the binning strategy before feeding to any method that compute information based metric, e.g. mutual_info. More clearly: Given a raw timeseries ts of real values, which is the correct procedure to apply to pyinform.mutualinfo.mutual_info? 1) binning (and/or coalesce_series) -> dist -> mutual_info 2) binning (and/or coalesce_series) -> mutual_info

jakehanson commented 1 year ago

Hi Jungla,

Method 2 is correct, as the distribution is built from the time series automatically.

Also, it looks like you are correct about not needing to run coalesce_series in addition to binning.

This means the correct process is just:

binning -> mutual_info