arthurpessa / ordpy

A Python package for data analysis with permutation entropy and ordinal network methods.
MIT License
81 stars 16 forks source link

2D computation crashes with embedding 4 and more #3

Closed yaseryacoob closed 2 years ago

yaseryacoob commented 2 years ago

For the case of images, the computation for any embedding above 3 is being killed due to memory consumption. Even if the image is just 100x100 and embedding 4,4 with lags 4 and 4. Any idea why?

arthurpessa commented 2 years ago

Which function are you specifically using? I've run the permutation_entropy() function with embedding parameters dx = dy = 4 and had no problems. The code I ran was:

ordpy.permutation_entropy(np.random.normal(size=(100,100)), dx=4, dy=4, taux=3)

yaseryacoob commented 2 years ago

I am running complexity_entropy(np.random.normal(size=(100,100)), dx=4, dy=4, taux=3,tauy=4)

arthurpessa commented 2 years ago

I think I've found the problem. Just a minute.

yaseryacoob commented 2 years ago

Great, one more thing, All embeddings above 4 fail with this (in this case 7)

hc_paintings = np.asarray([ permutation_entropy(gray_pollock, dx=7, dy=7, taux=7, tauy=7 )])

File "/vulcanscratch/yaser/SMART/ordpy/ordpy/ordpy.py", line 567, in permutation_entropy smax = np.log2(np.math.factorial(dx*dy)) TypeError: loop of ufunc does not support argument 0 of type int which has no callable log2 method

arthurpessa commented 2 years ago

What happens is that the number of accessible states (permutations) is combinatorially explosive and this is leading to a memory overload. For instance, dx = dy = 4 lead to 16! = 20,922,789,888,000 different possible permutations.

In order to calculate statistical complexity, we need to generate arrays which contain the probability of occurrence for all these different possible permutations (even when probabilities are zero) since this measure relies on the difference between the observed distribution and the uniform distribution.

We have faced these same problems a while back. We advise you to be mindful of the combinatorially explosive number of accessible states when using the tools implemented in our package. This matters sometimes, depending on the function used.

If you find any other problem or have another doubt, just write us.

yaseryacoob commented 2 years ago

Thanks for the attention and explanation, I was hoping there is computational efficiency possible to supporting such computation. Obviously the actual permutations are very sparse within all possible permutations. Essentially one would compute only permutations that are valid with respect to the image. The frustration I have is that 4x4 is typically too small for entropy of large images.

Anyway, I appreciate you sharing the code and interacting with me on the issues.

arthurpessa commented 2 years ago

You're welcome. If we find a solution to this we will implement it.

Best regards, Arthur Pessa.

hvribeiro commented 2 years ago

Dear Yaser,

There is a workaround for this issue. Please use the following replacements for the functions permutation_entropy and complexity_entropy:

def permutation_entropy(data, dx=3, dy=1, taux=1, tauy=1, base=2, normalized=True, probs=False, tie_precision=None):
    if not probs:
        _, probabilities = ordinal_distribution(data, dx, dy, taux, tauy, return_missing=False, tie_precision=tie_precision)
    else:
        probabilities = np.asarray(data)
        probabilities = probabilities[probabilities>0]

    n = float(np.math.factorial(dx*dy))

    if normalized==True and base in [2, '2']:        
        smax = np.log2(n)
        s    = -np.sum(probabilities*np.log2(probabilities))
        return s/smax

    elif normalized==True and base=='e':        
        smax = np.log(n)
        s    = -np.sum(probabilities*np.log(probabilities))
        return s/smax

    elif normalized==False and base in [2, '2']:
        return -np.sum(probabilities*np.log2(probabilities))

    else:
        return -np.sum(probabilities*np.log(probabilities))

def complexity_entropy(data, dx=3, dy=1, taux=1, tauy=1, probs=False, tie_precision=None):
    n = float(np.math.factorial(dx*dy))

    if probs==False:
        _, probabilities = ordinal_distribution(data, dx, dy, taux, tauy, return_missing=False, tie_precision=tie_precision)   
        h                = permutation_entropy(probabilities[probabilities>0], dx, dy, taux, tauy, probs=True, tie_precision=tie_precision)
    else:
        probabilities = np.asarray(data)
        h             = permutation_entropy(probabilities[probabilities>0], dx, dy, taux, tauy, probs=True, tie_precision=tie_precision)

    n_states_not_occuring = n-len(probabilities)
    uniform_dist = 1/n

    p_plus_u_over_2      = (uniform_dist + probabilities)/2  
    s_of_p_plus_u_over_2 = -np.sum(p_plus_u_over_2*np.log(p_plus_u_over_2)) - (0.5*uniform_dist)*np.log(0.5*uniform_dist)*n_states_not_occuring

    probabilities = probabilities[probabilities!=0]
    s_of_p_over_2 = -np.sum(probabilities*np.log(probabilities))/2
    s_of_u_over_2 = np.log(n)/2.

    js_div_max = -0.5*(((n+1)/n)*np.log(n+1) + np.log(n) - 2*np.log(2*n))    
    js_div     = s_of_p_plus_u_over_2 - s_of_p_over_2 - s_of_u_over_2

    return h, h*js_div/js_div_max

I have tested it and seems to work up to dx=dy=13.

image = np.random.normal(size=(100,100))
complexity_entropy_(image, dx=13, dy=13, taux=3,tauy=4)

Larger values yield an error with np.math.factorial function (OverflowError: int too large to convert to float).

We are about to update ordpy with new functions and we will implement this correction (or maybe a better one) in the near future.

Best, Haroldo.

yaseryacoob commented 2 years ago

Thanks Haroldo. It indeed works as you modified it, which is great. What is interesting is that the entropy goes down significantly as the embedding increases (maybe only for Pollack so I will check beyond). More oddly, the complexity becomes equal to the PE.

All interesting, as my intuition is low at this point.

Many thanks!

hvribeiro commented 2 years ago

Hi Yaser,

Indeed, $C \approx H$ as the number of possible states increases. This happens because the ordinal distribution will become very far from the uniform distribution (only a few states have non-null probability and most are zero.), and so the disequilibrium will approach its maximum value. You can see this in Eqs. (9) and (10) of ordpy's paper.

Best wishes, Haroldo.

yaseryacoob commented 2 years ago

The division by D_max in EQ 9 is what confuses me, obviously it explodes with the embedding squared. In essence the divergence is somewhat artificially driven here to low level. Does it make sense to replace "maximum possible value of 𝐷(𝑃,𝑈) " by "total observed unique ordinal values"? I can see this becoming an issue in edge cases, but not sure what else can be used, any ideas?

hvribeiro commented 2 years ago

I am not sure about your point, and I would be worried about what kind of information you get from the ordinal probability distribution when considering large embedding dimensions. The ordinal patterns are likely to be unique in this limit, and so the ordinal probability distribution is basically counting the number of states you get (which in turn is the number of partitions over the image). Maybe we can continue this conversation by email (hvr@dfi.uem.br) and maybe you can tell us more about what you are trying to do.

arthurpessa commented 2 years ago

Dr. @yaseryacoob, we thank you for using our package and for helping to improve it. We just uploaded a new version of our package, version 1.0.7, in which we changed all three complexity functions to speed up calculations with large values of dx and/or dy. There will still be problems for too large values of dx*dy, but we have found this version works better (and faster) than the previous one.

Best Regards, Arthur Pessa.