ELIFE-ASU / PyInform

A Python Wrapper for the Inform Information Analysis Library
https://elife-asu.github.io/PyInform
MIT License
45 stars 9 forks source link

running with k>2 raise "memory allocation failed" error #30

Open NealT87 opened 5 years ago

NealT87 commented 5 years ago

Any value above k>2 for transfer_entropy method creates the following issue(for k=1|2 it works):

test = pyinform.transfer_entropy(x,y,k=3)

Traceback (most recent call last): File "C:/Users/user/PycharmProjects/JerusalemProject/JerusalemProject/ActionActorAnalysis.py", line 279, in temp = pyinform.transfer_entropy(x,y,k=3) File "C:\Users\user\Anaconda2\envs\Python35\lib\site-packages\pyinform\transferentropy.py", line 179, in transfer_entropy error_guard(e) File "C:\Users\user\Anaconda2\envs\Python35\lib\site-packages\pyinform\error.py", line 57, in error_guard raise InformError(e,func) pyinform.error.InformError: an inform error occurred - "memory allocation failed"

dglmoore commented 5 years ago

Hi @NealT87. Thanks for the new issue! This error, admittedly vague, usually means that the C library couldn't allocate enough memory. The amount of memory necessary depends on

  1. the base of the time series provided
  2. the history length k

Could you share the range of values in the x and y time series?

NealT87 commented 5 years ago

Hi Douglas, Sorry for only responding now. It has been a hectic week. Thank you for the response. The range of X was range(0,6) integers and Y was range(0,37) integers. I used a base of 2.


Neal Tsur

PhD. Candidate | Civil Unrest, Prediction, Sociophysics, Complex SystemsAlgorithm Developer | M achine Learning, Neural Networks, NLPChildren's Physics Book Writer | "Shai-Li Asks Why?"

Tel: +972-50-644-9129 | Linkedin https://www.linkedin.com/in/neal-tsur/


On Thu, Mar 21, 2019 at 10:37 PM Douglas G. Moore notifications@github.com wrote:

Hi @NealT87 https://github.com/NealT87. Thanks for the new issue! This error, admittedly vague, usually means that the C library couldn't allocate enough memory. The amount of memory necessary depends

  1. the base of the time series provided
  2. the history length k Could you share the range of values in the x and y time series?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ELIFE-ASU/PyInform/issues/30#issuecomment-475392714, or mute the thread https://github.com/notifications/unsubscribe-auth/Ac2nJzuamp7Uzth9zW4NdifhRC-rC7plks5vY-2LgaJpZM4bWgDC .

silviaruiz44 commented 3 years ago

Were you able to solve the problem? I am running into the same problem.

dglmoore commented 3 years ago

@silviaruiz44 Thanks for reviving this issue. I suspect the problem is the range of values in your time series. If that's the case, then there are some workarounds.

If you wouldn't mind providing a sample of the source and target time series, that would be helpful for confirming the issues.

silviaruiz44 commented 3 years ago

Does the data have to be normalized or in close ranges? Why so?

I also have a question regarding the mutual information function. Does it depend on the scaling? I calculated the mutual information of a time series against itself and got value. When dividing the whole time series by a scalar and calculating the mutual information of the series again, I get a different value. Which is strange, because it is the same time series, (just scaled). I am wondering what is the interpretation or explanation to that?

Thanks in advance for your time.

dglmoore commented 3 years ago

@silviaruiz44 To the point of why the "memory allocation failed" error is happening. We use the data that you provide to construct histograms. Each bin of the histogram represents a different value that could possibly be observed in your data, and the histogram is stored in a dense form. Say we're dealing with transfer entropy from X to Y with a history length of k, and that X and Y can take integer (more on that below) values between 0 and 99. Then we'd need an array that can store 100 future states of Y, 100 past states of X, 100^4 values representing the past and future states of Y, and 100^5 values of the combined past of X, past of Y and future of Y, for a grand total of 1.01e10 integers representing the number of times each combination is actually observed. That will require something like 40GB of RAM, hence the allocation failure. In principle, this information could be stored more efficiently using a sparse memory representation, e.g. only store what you actually observe. However, there are performance trade-offs and questions of statistical significance when you get into situations like the one above. Sometimes there are workarounds, so let me know if you are dead-set on being able to apply these methods to data like this.

Now to a bigger issue. PyInform doesn't really support continuously-valued data. The data that you pass into the time series measures, e.g. transferentropy, has to be integer-valued. We're essentially estimating the probabilities of an event using frequencies taken from the time series, and that doesn't make much sense with continuously-valued data. There are methods for handling continuous data, but they aren't currently implmented in (Py)Inform. The documentation mentions this, but not emphatically enough (you're not the first person to run into this issue.)

I'd wager that the reason the mutual information changes when you scale the values has to do with how C casts values. We are using numpy internally to convert the data you provide into numpy arrays with integer values, and numpy doesn't complain when you do something like numpy.asarray([3.0, 4.0, 5.0, 6.0], dtype=np.int32). It just happily passes the input along to C who then casts the values to integers, so you end up with [3, 4, 5, 6]. However, if you first divide the values by 2 before giving them to pyinform, the resulting array will be [1, 2, 2, 3]. You go from having 4 distinct values to only having 3.

Ideally, the time series functions would raise an exception if you provide continuously-valued data; however, we haven't decided exactly how we want to handle that since it requires an additional pass over the data to check the types.

All of that said, you have a couple of options for dealing with continuous data.

Binning

Pyinform provides some (primitive) methods for binning continuously-valued data. You can choose to bin using a fixed number of bins, a fixed bin size, or specify the boundaries between bins. There are lots of different ways of choosing, for example, the width of the bins, e.g. the Freedman-Diaconis rule or Sturges's rule. If you are dealing with data that can be easily thought of as binary, e.g. a neuron is spiking or it isn't, then you can pick a threshold and call any value above it 1 and anything below it 0.

Most of the data that I deal with personally can be reasonably binned, but that's not always the case and doing so can introduce artifacts and bias. An alternative is to use the continuous data directly.

JIDT A really good method for estimating mutual information (and transfer entropy, which is just a special case of conditional mutual information), is the Kraskov-Stögbauer-Grassberger estimator (KSG). Unfortunatly, (Py)Inform doesn't implement it at the moment because I just haven't had enough time or the energy to implement a KD-tree in C :smile:. If this is something that you desperately need, we can see about bumping this issue up in the priorities list.

In the meantime, I'd recommend considering JIDT if binning your data just won't work for what you want to do. It has just about all of the features of (Py)Inform and then some, including implementations of the KSG estimator (which JIDT calls Kraskov). It's written in Java, but it has tutorials of how to use it from Python.

silviaruiz44 commented 3 years ago

Thank you so much for your answer! It helps me a lot. I have a last question. How can we test the significance or accuracy of the mutual information estimates?