alchemistry / alchemlyb

the simple alchemistry library

https://alchemlyb.readthedocs.io

BSD 3-Clause "New" or "Revised" License

197 stars 51 forks source link

Sync series and dataframe in statistical_inefficieny() #199

Closed ptmerz closed 2 years ago

ptmerz commented 2 years ago

This adds a number of tests for slicing() and statistical_inefficieny():

Test that slicing() respects upper and lower time bounds (currently passes)
Test that statistical_inefficieny() respects upper and lower time bounds when it is used without series to subsample (currently passes)
Test that statistical_inefficieny() respects upper and lower time bounds when it is used with series to subsample (currently fails)
Test that first using slicing() on the data frame, then statistical_inefficieny() without time bounds yields the same results as a single call to statistical_inefficieny() with time bounds (currently fails)

It then makes sure that both the subsampling series and the dataframe are sliced when calling statistical_inefficieny(). This ensure that the series and the dataframe are in sync before the subsampled indices of the series are applied to the dataframe. This fixes the above tests.

Fixes #198

codecov[bot] commented 2 years ago

Codecov Report

Merging #199 (cc122a8) into master (72622c9) will increase coverage by 0.00%. The diff coverage is 100.00%.

@@           Coverage Diff           @@
##           master     #199   +/-   ##
=======================================
  Coverage   98.15%   98.15%           
=======================================
  Files          24       24           
  Lines        1352     1353    +1     
  Branches      295      295           
=======================================
+ Hits         1327     1328    +1     
  Misses          5        5           
  Partials       20       20

Impacted Files	Coverage Δ
src/alchemlyb/preprocessing/subsampling.py	`100.00% <100.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 72622c9...cc122a8. Read the comment docs.

ptmerz commented 2 years ago

I think the test could be simplified as bit using the @pytest.mark.parametrize

Agreed that this could unify the second and third added test (TestStatisticalInefficiency. test_lower_and_upper_bound_slicer and TestStatisticalInefficiency.test_lower_and_upper_bound_inefficiency). Unless we decide to rewrite TestStatisticalInefficiency. test_lower_and_upper_bound_slicer as an equivalence between statistical_inefficiency and slicing, of course.

So I'd suggest to wait until we have agreed on how to proceed with the remaining points raised, and I will implement this simplification if it's still viable!

ptmerz commented 2 years ago

@xiki-tempula @orbeckst

Thank you for your reviews. I did a first pass, which should address most of your comments. I'm copying the git log here as a summary:

Extract input sanity check in separate function
Extract test for unchanged data in separate test
Parametrize tests of statistical_inefficiency to merge cases which use series with the ones that don't
Parametrize all tests to use conservative True and False
Remove check for unchanged data that are now unnecessary (in the last equivalence test, since we now know that all operations don't change the input data)

This doesn't yet address the issue with greedy parametrization, I unfortunately did not have time to look into this yet.

ptmerz commented 2 years ago

@orbeckst , I uploaded new commits that should hopefully address your points. Note that currently, the data loading functions had to be duplicated, as they can't be normal functions and fixtures at the same time. If after resolving #206, you only use them as fixtures, you could remove this duplication.