`block_entropy()` does not match `scipy.stats.entropy()` when k = 1, as expected

ciaranmci commented 1 year ago

N-gram entropy should reduce to Shannon entropy for k = 1
"...take as an argument the base of the state and use that as the base of the logarithm...when k is 1"

Taking the example given in the documentation:

a = [0,0,1,1,1,1,0,0,0]
base_a = len(set(a))
pyinform.block_entropy(a, k = 1)
# 0.9910760598382222
scipy.stats.entropy(a, base = base_a ) 
# 2.0

What have I got wrong?

jakehanson commented 1 year ago

Scipy is interpreting a as a probability distribution whereas Pyinform is interpreting a as a time series. The probability distribution corresponding to a is [5/9, 4/9] so if you feed that into Scipy you will get the same entropy.

To convert a time series into a probability distribution using Pyinform you use the Dist class, which is a histogram of your time series observations. Then, you use the dump method to convert your histogram into a probability distribution.

from pyinform.dist import Dist

a = [0,0,1,1,1,1,0,0,0]  # time series 
base_a = len(set(a))

a_dist = dist(base_a) # initialize dist over support of size 2

## Count observations in each bin
for obs in a:
    a_dist.tick(obs)

## View probability distribution
a_dist.dump()
# [0.55556, 0.44444] 

pyinform.block_entropy(a, k = 1)
# 0.9910760598382222
scipy.stats.entropy(a_dist.dump(), base = base_a ) 
# 0.9910760598382222

Pyinform is intentionally built around time series observations because it is designed for ease of use with empirical data, but behind the scenes time series are almost always converted to distributions using the Dist class.

ciaranmci commented 1 year ago

Ah, got it. That is very clear and well explained. Thanks, @jakehanson

ciaranmci commented 1 year ago

Just a small note about a typo for anyone who references this, in the future: the dist(...) call should be Dist(...).

Specifically, this...

from pyinform.dist import Dist
# ...
a_dist = dist(base_a) # initialize dist over support of size 2
# ...

...should be...

from pyinform.dist import Dist
# ...
a_dist = Dist(base_a) # initialize Dist over support of size 2
# ...

ELIFE-ASU / PyInform

`block_entropy()` does not match `scipy.stats.entropy()` when k = 1, as expected #41