JuliaDynamics / ComplexityMeasures.jl

Estimators for probabilities, entropies, and other complexity measures derived from data in the context of nonlinear dynamics and complex systems

MIT License

53 stars 12 forks source link

GaoNaive and GaoNaiveCorrected estimators #232

Closed kahaaga closed 1 year ago

kahaaga commented 1 year ago

This PR follows up on #230, and introduces two more Shannon differential entropy estimators from the CausalityTools dev branch: GaoNaive and GaoNaiveCorrected (Gao et al., 2005), which both are based on estimators from Singh et al. (2003).

References

Gao, S., Ver Steeg, G., & Galstyan, A. (2015, February). Efficient estimation of mutual information for strongly dependent variables. In Artificial intelligence and statistics (pp. 277-286). PMLR.

Singh, H., Misra, N., Hnizdo, V., Fedorowicz, A., & Demchuk, E. (2003). Nearest neighbor estimates of entropy. American journal of mathematical and management sciences, 23(3-4), 301-321.

codecov[bot] commented 1 year ago

Codecov Report

Merging #232 (17efed6) into main (458933b) will increase coverage by 0.24%. The diff coverage is 100.00%.

@@            Coverage Diff             @@
##             main     #232      +/-   ##
==========================================
+ Coverage   84.75%   85.00%   +0.24%     
==========================================
  Files          47       48       +1     
  Lines        1141     1160      +19     
==========================================
+ Hits          967      986      +19     
  Misses        174      174

Impacted Files	Coverage Δ
..._estimators/nearest_neighbors/nearest_neighbors.jl	`100.00% <ø> (ø)`
...entropies_estimators/nearest_neighbors/GaoNaive.jl	`100.00% <100.00%> (ø)`

:mega: We’re building smart automated test selection to slash your CI/CD build times. Learn more

Datseris commented 1 year ago

isn't the corrected version just better? why do we need both?

kahaaga commented 1 year ago

isn't the corrected version just better? why do we need both?

The corrected version better. But I think for educational purposes, it is nice to have both. There's also potential research questions that can be addressed, e.g. "does it really matter, for conditional independence testing, that my estimator isn't asymptotically rebiased, if I'm using some sort of null hypothesis test where this biased estimator is applied everywhere; i.e. not only to my original data, but also to the surrogate ensemble"?

Perhaps the docstrings could be flipped: the main docs for GaoNaiveCorrected, and the docs in GaoNaive just states that this is similar, but without bias correction, and is included for educational purposes / completeness or whatever.

In principle, we don't need any estimator besides the one that performs the best. But we're building a library, so I think we should include whatever published estimators exist. And then it is up to the user to determine what is useful for them.

kahaaga commented 1 year ago

I think an argument can be made that the difference between the GaoNaive and the GaoNaiveCorrected is analogous to the difference between, say, using ValueHistogram to compute entropy in a non-bias-corrected way (as we currently do), and applying bias correction based on the binning (which we probably will offer at some point).

kahaaga commented 1 year ago

We could also just offer correct_bias as a field to GaoNaive.

Datseris commented 1 year ago

We could also just offer correct_bias as a field to GaoNaive.

Yes please, that is the best. We should be careful for counting reasons, when we count for the paper and say "we have 50 estimators", to not have someone say "well but 10 of these are the same".

kahaaga commented 1 year ago

@Datseris I've merge GaoNaive and GaoNaiveCorrected into a single estimator: Gao. It has the keyword corrected::Bool, which indicates whether correction should be applied or not. Tests and the documentation have been updated accordingly.