alchemistry / alchemlyb

the simple alchemistry library
https://alchemlyb.readthedocs.io
BSD 3-Clause "New" or "Revised" License
199 stars 51 forks source link

Pin pandas>=2.2 #413

Open xiki-tempula opened 1 month ago

xiki-tempula commented 1 month ago

407 won't work with pandas<2.2

Gives

Cell In[3], line 1
----> 1 wf.run(estimators='MBAR')

File ~/miniconda3/envs/md-flux-workflows/lib/python3.10/site-packages/alchemlyb/workflows/abfe.py:275, in ABFE.run(self, skiptime, uncorr, threshold, estimators, overlap, breakdown, forwrev, *args, **kwargs)
    273     self.preprocess(skiptime=skiptime, uncorr=uncorr, threshold=threshold)
    274 if estimators is not None:
--> 275     self.estimate(estimators)
    276     self.generate_result()
    278 if overlap is not None and use_FEP:

File ~/miniconda3/envs/md-flux-workflows/lib/python3.10/site-packages/alchemlyb/workflows/abfe.py:447, in ABFE.estimate(self, estimators, **kwargs)
    442     logger.info("Run MBAR estimator.")
    443     warnings.warn(
    444         "From 2.2.0, n_bootstraps=50 will be the default for estimating MBAR error.",
    445         DeprecationWarning,
    446     )
--> 447     self.estimator[estimator] = MBAR(**kwargs).fit(u_nk)
    448 elif estimator == "BAR":
    449     logger.info("Run BAR estimator.")

File ~/miniconda3/envs/md-flux-workflows/lib/python3.10/site-packages/alchemlyb/estimators/mbar_.py:129, in MBAR.fit(self, u_nk)
    126 u_nk = u_nk.sort_index(level=u_nk.index.names[1:])
    128 groups = u_nk.groupby(level=u_nk.index.names[1:])
--> 129 N_k = [
    130     (
    131         len(groups.get_group(i if isinstance(i, tuple) else (i,)))
    132         if i in groups.groups
    133         else 0
    134     )
    135     for i in u_nk.columns
    136 ]
    137 self._states_ = u_nk.columns.values.tolist()
    139 if isinstance(self.initial_f_k, str) and self.initial_f_k == "BAR":

File ~/miniconda3/envs/md-flux-workflows/lib/python3.10/site-packages/alchemlyb/estimators/mbar_.py:131, in <listcomp>(.0)
    126 u_nk = u_nk.sort_index(level=u_nk.index.names[1:])
    128 groups = u_nk.groupby(level=u_nk.index.names[1:])
    129 N_k = [
    130     (
--> 131         len(groups.get_group(i if isinstance(i, tuple) else (i,)))
    132         if i in groups.groups
    133         else 0
    134     )
    135     for i in u_nk.columns
    136 ]
    137 self._states_ = u_nk.columns.values.tolist()
    139 if isinstance(self.initial_f_k, str) and self.initial_f_k == "BAR":

File ~/miniconda3/envs/md-flux-workflows/lib/python3.10/site-packages/pandas/core/groupby/groupby.py:1063, in BaseGroupBy.get_group(self, name, obj)
   1061 inds = self._get_index(name)
   1062 if not len(inds):
-> 1063     raise KeyError(name)
   1065 if obj is None:
   1066     indexer = inds if self.axis == 0 else (slice(None), inds)

KeyError: (0.0,)

For workflow.

xiki-tempula commented 1 month ago

@jaclark5 Do you think it is possible to make the current code work with both pandas 2.2 and 2.1? The quite annoy thing is that stable version of rdkit doesn't always support the latest version of pandas. You shall be able to reproduce the problem with test_fep_estimators.py::TestMBAR::test_mbar

jaclark5 commented 1 month ago

This error is likely because the columns for single lambda simulations are not tuples when calling groupby in the preceding line. Before my PR, your code was consistent in not having a tuple at all, but then there was a future warning that the input for get_groups needed to be a tuple, so I fixed that. Now that the input is a tuple like pandas wanted, it seems to be complaining that the tuple isn't recognized by the output of groupby it must be in the preceding version of panadas that if a tuple was of length one it would pull the value... that's quite frustrating.

It seems to me that the solution would be to have all lambda state column headers be tuples all the time. That would take a bit of reformatting.

xiki-tempula commented 1 month ago

@jaclark5 I do agree that ave all lambda state column headers be tuples all the time would be good but that might break a lot of people's implementation as in most cases lambda would be a single float.

jaclark5 commented 1 month ago

@xiki-tempula @orbeckst I looked at the changes to pandas.core.groupby.BaseGroupBy.get_group() and the difference between pandas v2.2 and <v2.2 is the new addition of the following code:

        keys = self.keys
        level = self.level
        # mypy doesn't recognize level/keys as being sized when passed to len
        if (is_list_like(level) and len(level) == 1) or (  # type: ignore[arg-type]
            is_list_like(keys) and len(keys) == 1  # type: ignore[arg-type]
        ):
            # GH#25971
            if isinstance(name, tuple) and len(name) == 1:
                # Allow users to pass tuples of length 1 to silence warning
                name = name[0]
            elif not isinstance(name, tuple):
                warnings.warn(
                    "When grouping with a length-1 list-like, "
                    "you will need to pass a length-1 tuple to get_group in a future "
                    "version of pandas. Pass `(name,)` instead of `name` to silence "
                    "this warning.",
                    FutureWarning,
                    stacklevel=find_stack_level(),
                )

This warning is quite annoying since as soon as you given it the tuple it requests, it undoes that... it seems like heeding this warning was premature as pandas has not yet transitioned to the internal handling of groups that they envision and are seemingly laying the ground work for. It seems my PR should be removed to allow alchemlyb to function with the greatest breadth of pandas versions and it can be reintroduced or adjusted in the future when groupby is more compatible to the requested behavior of get_group in pandas >= v2.2

I no longer think adjusting alchemlyb to use tuples in lambda state columns will resolve the issue, since the output of u_nk.index.names[1:] for a single lambda state will still be integer values in groupby and not tuples like the get_group method for pandas >= v2.2 wants.