choderalab / pymbar

Python implementation of the multistate Bennett acceptance ratio (MBAR)
http://pymbar.readthedocs.io
MIT License
240 stars 93 forks source link

robust method won't work with verbose=True #530

Open xiki-tempula opened 6 months ago

xiki-tempula commented 6 months ago
from alchemtest.gmx import load_ABFE
from alchemlyb.parsing.gmx import extract_u_nk
import pandas as pd
import pymbar

df = pd.concat([extract_u_nk(data, 310) for data in load_ABFE()['data']['complex']])
u_nk = df.sort_index(level=df.index.names[1:])
groups = u_nk.groupby(level=u_nk.index.names[1:])
N_k = [
    (len(groups.get_group(i)) if i in groups.groups else 0)
    for i in u_nk.columns
 ]
mbar = pymbar.MBAR(
    u_nk.T,
    N_k,
    verbose=False, 
    solver_protocol="robust")

works but

mbar = pymbar.MBAR(
    u_nk.T,
    N_k,
    verbose=True, 
    solver_protocol="robust")

Gives

INFO:pymbar.mbar:K (total states) = 30, total samples = 30030
---------------------------------------------------------------------------
InvalidIndexError                         Traceback (most recent call last)
Cell In[48], line 1
----> 1 mbar = pymbar.MBAR(
      2             u_nk.T,
      3             N_k,verbose=True, solver_protocol="robust")

File ~/miniconda3/envs/ubar/lib/python3.10/site-packages/pymbar/mbar.py:296, in MBAR.__init__(self, u_kn, N_k, maximum_iterations, relative_tolerance, verbose, initial_f_k, solver_protocol, initialize, x_kindices, n_bootstraps, bootstrap_solver_protocol, rseed)
    294 for l in range(k):
    295     diffsum = 0
--> 296     uzero = u_kn[k, indices] - u_kn[l, indices]
    297     diffsum += np.dot(uzero, uzero)
    298     if diffsum < relative_tolerance:

File ~/miniconda3/envs/ubar/lib/python3.10/site-packages/pandas/core/frame.py:3892, in DataFrame.__getitem__(self, key)
   3890 if is_single_key:
   3891     if self.columns.nlevels > 1:
-> 3892         return self._getitem_multilevel(key)
   3893     indexer = self.columns.get_loc(key)
   3894     if is_integer(indexer):

File ~/miniconda3/envs/ubar/lib/python3.10/site-packages/pandas/core/frame.py:3950, in DataFrame._getitem_multilevel(self, key)
   3948 def _getitem_multilevel(self, key):
   3949     # self.columns is a MultiIndex
-> 3950     loc = self.columns.get_loc(key)
   3951     if isinstance(loc, (slice, np.ndarray)):
   3952         new_columns = self.columns[loc]

File ~/miniconda3/envs/ubar/lib/python3.10/site-packages/pandas/core/indexes/multi.py:2908, in MultiIndex.get_loc(self, key)
   2867 def get_loc(self, key):
   2868     """
   2869     Get location for a label or a tuple of labels.
   2870 
   (...)
   2906     1
   2907     """
-> 2908     self._check_indexing_error(key)
   2910     def _maybe_to_slice(loc):
   2911         """convert integer indexer to boolean mask or slice if possible"""

File ~/miniconda3/envs/ubar/lib/python3.10/site-packages/pandas/core/indexes/multi.py:2628, in MultiIndex._check_indexing_error(self, key)
   2623 def _check_indexing_error(self, key) -> None:
   2624     if not is_hashable(key) or is_iterator(key):
   2625         # We allow tuples if they are hashable, whereas other Index
   2626         #  subclasses require scalar.
   2627         # We have to explicitly exclude generators, as these are hashable.
-> 2628         raise InvalidIndexError(key)

InvalidIndexError: (1, array([13625,  5804, 29848,  2864,  3890,  9757,  3711, 21736, 29484,
       19307, 22064,  4321, 20775, 14660, 19313, 19594, 10850,  6805,
        5670, 12244, 26369,  2992, 24401,  5940, 21046, 24756, 22339,
       27396, 14598, 29139, 28672,  1400, 27032, 18373, 19362, 10255,
        1001,  3949, 24404, 16882, 13914, 13274,  1800, 26756, 25353,
       14357,  7717, 26710, 26259, 29397]))
mikemhenry commented 6 months ago

That is really interesting, can you share the data that reproduces this? Does it happen consistently?

xiki-tempula commented 6 months ago

The data comes from the package alchemtest, you can just pip install it.

conda create -n test -c conda-forge alchemlyb
conda activate test
pip install alchemtest
mikemhenry commented 6 months ago

Thanks! (I wasn't playing enough attention and just assumed the pandas data frame came from disk or something)

frannerin commented 3 months ago

Just happened to me and digging a bit it's fixed changing the following line in mbar.py:

uzero = u_kn[k, indices] - u_kn[l, indices]

to

uzero = u_kn.iloc[k, indices] - u_kn.iloc[l, indices]

because k, l and indices are 0-based indices coming from the execution of the builtin range() and np.arange() and not values from the index or columns of u_kn.

mrshirts commented 3 months ago

I'd be a little careful about changing lines in pymbar without checking effects on everything else. More likely to be something off with the way that alchemlyb is passing data into pymbar?

mrshirts commented 3 months ago

Let me be more precise. pymbar doesn't used pandas, so if it's a pandas error, it has to do with what is being passed into pymbar, i.e. something weird with the way the pandas dataframe is being interpreted as a numpy array as pymbar, which is not really a pymbar error, since it's expecting a numpy array - it's an issue with the way pandas are being cast as numpy arrays.

xiki-tempula commented 3 months ago

I see. That's easy to solve then I shall just convert this to a numpy array.

xiki-tempula commented 3 months ago

But it is still weird why would this only happen when verbose=True

frannerin commented 3 months ago

That whole block is inside an if verbose=True clause

xiki-tempula commented 3 months ago

I see. Then casting the input to numpy array should be good.