alchemistry / alchemlyb

the simple alchemistry library
https://alchemlyb.readthedocs.io
BSD 3-Clause "New" or "Revised" License
191 stars 49 forks source link

statistical_inefficiency does not slice data frame #198

Closed ptmerz closed 2 years ago

ptmerz commented 2 years ago

Description

When using alchemlyb.preprocessing.statistical_inefficiency with a data frame, a series to calculate statistical inefficiency on, and slicing parameters (lower, upper, step), the slicing is only applied to the series, but not to the data frame. Since statistical_inefficiency uses a raw index to subsample the data frame, the series and the data frame are out of sync by the end of the function. This leads to results that seem, at the very least, unintuitive.

Examples

I think that a few examples make this much easier to explain. Assume I have loaded some data, e.g.

from alchemlyb.parsing.gmx import extract_u_nk
data = extract_u_nk('lambda_0.xvg', T=298.15)

and want to use only part of this data, e.g.

lower = 1000
upper = 5000

then I would expect a few things:

  1. Slicing removes times before and after my set limits → this works ✅

    from alchemlyb.preprocessing import slicing
    sliced_data = slicing(data, lower=lower, upper=upper)
    assert all(sliced_data.reset_index()['time'] >= lower)
    assert all(sliced_data.reset_index()['time'] <= upper)
  2. Using the slicing arguments with statistical_inefficiency removes times before and after my set limits → this does not work ❌

    from alchemlyb.preprocessing import statistical_inefficiency
    subsampled_data = statistical_inefficiency(data, data.columns[0], lower=lower, upper=upper)
    assert all(subsampled_data.reset_index()['time'] >= lower)
    assert all(subsampled_data.reset_index()['time'] <= upper)

    Specifically, I find that the times are shifted, i.e. that the lowest time of subsampled_data is 0, and the highest time is upper - lower.

  3. Using slicing on the data frame before running statistical_inefficiency, or running statistical_inefficiency with slicing parameters leads to the same result → this does not work ❌

    
    sliced_data = slicing(data, lower=lower, upper=upper)
    subsampled_sliced_data = statistical_inefficiency(sliced_data, data.columns[0])

subsampled_data = statistical_inefficiency(data, data.columns[0], lower=lower, upper=upper)

assert (subsampled_data == subsampled_sliced_data).all(axis=None)

The problem here is, unsurprisingly, similar to the one above: `subsampled_data` starts at time 0, ends at time `upper - lower`, whereas `subsampled_sliced_data` starts at time `lower` and ends at time `upper`.

4. The issues demonstrated in cases 2 and 3 are especially problematic if we imagine a use case in which we would repeatedly sample from the same data, e.g.
```python
window_size=1000
for window_idx in range(5):
    lower = window_idx * window_size
    upper = (window_idx + 1) * window_size
    subsampled_data = statistical_inefficiency(data, data.columns[0], lower=lower, upper=upper)
    # ...
    # further analysis of subsampled_data for window here
    # ...

My expectation here would be to analyze windows of different data at every iteration. The current implementation would, however, always subsample from the window 0 to 1000. (The subsampled data might differ between windows, since the statistical inefficiency is actually calculated on a different window of the series each time, but the resulting indices are applied to the same window of the data frame every time.)

  1. These issues could lead to correlated samples in case that the step keyword is used. Imagine that we have data which is correlated such that the conservative subsampling algorithm picks every 10th frame. If we were to use statistical_inefficiency with step=10 (by chance, or because we have a feeling that we might have sampled data too frequently), then statistical_inefficiency would return every frame of the first 10% of the data! So we would have lost a lot of information, and would continue working with heavily sampled data.

Proposed solution

I will propose a change shortly which adds tests (in line with examples 1 to 3) and a fix for this behavior. I think the most stable fix is to call slicing on both the series and the dataframe. It might be possible to calculate an offset index plus a stride to avoid slicing the dataframe, but that seems like pointless overengineering since there is a slicing function readily available that makes sure that the series and dataframe slicing will always be done in the same way.

PS: Maybe I've misunderstood everything and this is intended behavior. In this case, please close the PR, but I would then suggest to add some documentation (ideally also a note when calling the function) warning about this behavior 🙂

xiki-tempula commented 2 years ago

@ptmerz Many thanks for reporting this and come up with the PR. I will have a look at it.