elki-project / elki

ELKI Data Mining Toolkit
https://elki-project.github.io/
GNU Affero General Public License v3.0
781 stars 321 forks source link

Implemented Bivariate Monte Carlo Density Estimation using Mann-Whitney-P Test (MCDE MWP) #65

Closed alanmazankiewicz closed 4 years ago

alanmazankiewicz commented 4 years ago

Algorithm is based on this paper: https://dbis.ipd.kit.edu/download/MCDE19_FOUCHE_SSDBM19.pdf

A abstract class MCDEDependenceMeasure has been created as subclass of AbstractDependenceMeasure. MCDEDependenceMeasure implements the generic Monte Carlo runs in the dependence() method returning the overall result of the algorithm. Further, it implements the generic slicing process and declares some methods (computation of statistical test and index structure) to be implemented by subclasses. Generally the slicing and the final MC runs are independent of the statistical test to be used. Each subclass should therefore only implement the statistical test (returning the p-value not the test statistic) and an appropriate index structure to efficiently compute the given statistical test. As each statistical test may need specific addition values beyond an index value and an adjustment for ties using ranks() from AbstractDependenceMeasure may not be enough. MWP for instance needs an additional correction value. In order to hold the data of the index structure tuple-like RankStruct and MwpIndex have been created in utils/containers.

McdeMwpDependenceMeasure is a subclass of MCDEDependenceMeasure implementing MCDE using MWP as the test statistic.

We decided on this structure instead of passing external statistical test classes (GoodnessOfFitTest) like in HiCSDependenceMeasure because in this way it is possible to use tailored index structures to more efficiently compute the resulting statistical tests.

Appropriate Test Classes have been created.

codecov-io commented 4 years ago

Codecov Report

Merging #65 into master will increase coverage by 0.06%. The diff coverage is 79.36%.

Impacted file tree graph

@@             Coverage Diff              @@
##             master      #65      +/-   ##
============================================
+ Coverage     49.41%   49.48%   +0.06%     
- Complexity    11437    11461      +24     
============================================
  Files          1650     1653       +3     
  Lines         81054    81180     +126     
  Branches      14519    14541      +22     
============================================
+ Hits          40056    40168     +112     
- Misses        38042    38056      +14     
  Partials       2956     2956
Impacted Files Coverage Δ Complexity Δ
.../math/statistics/dependence/DependenceMeasure.java 90% <ø> (ø) 4 <0> (ø) :arrow_down:
...atistics/dependence/AbstractDependenceMeasure.java 56.45% <ø> (ø) 17 <0> (ø) :arrow_down:
...java/elki/math/statistics/tests/mcde/MCDETest.java 100% <100%> (ø) 0 <0> (?)
...h/statistics/dependence/MCDEDependenceMeasure.java 67.94% <67.94%> (ø) 8 <8> (?)
.../java/elki/math/statistics/tests/mcde/MWPTest.java 97.77% <97.77%> (ø) 13 <13> (?)
...a/elki/index/idistance/InMemoryIDistanceIndex.java 88.43% <0%> (+1.36%) 20% <0%> (ø) :arrow_down:
...ties/optionhandling/parameters/ClassParameter.java 29.85% <0%> (+4.47%) 9% <0%> (ø) :arrow_down:
... and 3 more

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update ef91a03...c17ac45. Read the comment docs.

alanmazankiewicz commented 4 years ago

Friendly reminder :)

kno10 commented 4 years ago

I will likely have to merge this as a single commit, because rebasing causes some conflict in almost every merge (likely because of the merges that were performed before). They all seem trivial, but will require some work. But I guess that is okay.

I have given you another round of detailed comments. Some of them I could trivially have changed when merging; like code style. The biggest thing to resolve is that generics warnings.

Would it make sense to change MCDETest.RankStruct[] into a MCDETest.RankStruct that stores an int[] instead (this would save a lot of memory, and be faster)? Similarly, the subtype MWPTest.MwpIndex could store a long[] correction and double[] adjusted. So instead of N objects (at 64+ bytes each), this would be one such object for each attribute, I guess. This should use some 16-24 x N bytes less memory, and will likely be faster to access, too.

kno10 commented 4 years ago

I have pushed an edited version to branch https://github.com/elki-project/elki/tree/feature/mcde This is not completely done yet, but maybe you can have a look at the edits I made.

alanmazankiewicz commented 4 years ago

Thank you so much for taking care of those changes. I looked at the code, tested it and merged it into my master as it seems to work. I just had to take care of a comment at one point. Let me know if you do any further changes. Please note the my two comments above.

kno10 commented 4 years ago

I was wondering if you want to also add the "higher order" version, maybe as public <A> double higherOrderDependence(NumberArrayAdapter<?, A> adapter, List<? extends A> data) for completeness. There wouldn't be alternatives for this use case right now (and its not in the DependencyMeasure interface), but someone searching for MCDE code and looking at ELKI would still find it. Thanks for the fix to the JavaDoc, I will add this. After committing this to master, I will delete this feature/mcde branch again. Afterwards I will also rename all the *DependencyMeasure classes to just Dependency, and in some cases I may also omit the Dependency part; just to make class names shorter everywhere. The Measure is a bit redundant, and unless there is a name collision (HiCS for example), I will now prefer shorter names. So DistanceCorrelationDependencyMeasure will become just DistanceCorrelation (its widely known as dCor or dCor², but that does not make a good class name). Maybe I will even use DCor. Because in the end "distance" does suit this very well, it uses one-dimensional differences.

alanmazankiewicz commented 4 years ago

What you mean by higher order version? As far as I can tell you have overridden a dependence method having the same signature public <A> double dependence(NumberArrayAdapter<?, A> adapter, List<? extends A> data).

kno10 commented 4 years ago

The method I added produces pairwise bivariate results. I think your method is intended to make higher-order correlations of multiple variates, right (not sure how they are defined though; whether it is a 1 with d-1 dimensions correlation as in HiCS?) Essentially the case where you do need the pow(alpha or beta, dim-1). Sorry, I have not yet read your paper, so I do not use the exact same terminology as you used.

alanmazankiewicz commented 4 years ago

https://github.com/alanmazankiewicz/elki/blob/c17ac459c81d9872892f3fb823350da91ce57e9e/elki-core-math/src/main/java/elki/math/statistics/tests/mcde/MWPTest.java#L58-L63

Btw. I have just seen that those lines of code are not necessary anymore, it is a relict from an older version.

alanmazankiewicz commented 4 years ago

The method I added produces pairwise bivariate results. I think your method is intended to make higher-order correlations of multiple variates, right (not sure how they are defined though; whether it is a 1 with d-1 dimensions correlation as in HiCS?) Essentially the case where you do need the pow(alpha or beta, dim-1). Sorry, I have not yet read your paper, so I do not use the exact same terminology as you used.

Yes! I was gonna ask you if you could provide us with an abstract method for computing multivariate dependences so that we can implement a multivariate version of MCDE. I will implement it using this signature public <A> double higherOrderDependence(NumberArrayAdapter<?, A> adapter, List<? extends A> data) simply in MCDEDependenceMeasure. However, it would probably be best if we first get this pull request through (as you are going to do some further changes to the classnames etc.) and then I will fetch elki again and take care of the new method?

alanmazankiewicz commented 4 years ago

One more thing I just found. The correction value of the MWPRanking has to be double[]. I checked it and in following lines the change has to be made:

https://github.com/elki-project/elki/blob/b183fa6028e429ec0eac65fb5ef6fe2994e70bb6/elki-core-math/src/main/java/elki/math/statistics/tests/mcde/MWPTest.java#L80

https://github.com/elki-project/elki/blob/b183fa6028e429ec0eac65fb5ef6fe2994e70bb6/elki-core-math/src/main/java/elki/math/statistics/tests/mcde/MWPTest.java#L94

https://github.com/elki-project/elki/blob/b183fa6028e429ec0eac65fb5ef6fe2994e70bb6/elki-core-math/src/main/java/elki/math/statistics/tests/mcde/MWPTest.java#L112

https://github.com/elki-project/elki/blob/b183fa6028e429ec0eac65fb5ef6fe2994e70bb6/elki-core-math/src/main/java/elki/math/statistics/tests/mcde/MWPTest.java#L173-L175 (in L175 the double cast is not required anymore)

kno10 commented 4 years ago

Why a double? The values accumulated in correction are all integers. Yes, if you have many of them they can overflow, but I'd rather catch this than use a double here. Because if this long overflows, the double will run out of precision much earlier, and accumulation will lose all small values and become unreliable, too.

alanmazankiewicz commented 4 years ago

Yes you are right, correction can stay a long. When I was investigating if std == 0 is possible (see responds above btw.) I just noticed that our scala implementation uses double. Regarding the overflow I trust you to handle this appropriately however you see fit.

kno10 commented 4 years ago

Closing, as the rebased version has been merged.