alchemistry / alchemlyb

the simple alchemistry library
https://alchemlyb.readthedocs.io
BSD 3-Clause "New" or "Revised" License
197 stars 51 forks source link

alchemlyb tutorial in google colab #196

Closed andresilvapimentel closed 2 years ago

andresilvapimentel commented 2 years ago

It would be nice if the developers write an alchemlyb tutorial in google colab using a simple ligand/protein complex example.

andresilvapimentel commented 2 years ago

I was trying to build a notebook using my multiple xvg files, however I did not get to convert the multiple xvg files (from each lambda) to a dataset similar to benzene_load How can I perform this task?

andresilvapimentel commented 2 years ago

In my notebook, It works fine if I use the benzene_load dataset, but I did not get to convert my personal multiple xvg files (from each lambda) into a dataset to work with my notebook.

xiki-tempula commented 2 years ago

@andresilvapimentel Hi, thank you for your interest in alchemlyb we have an experimental workflow that should do the calculation automatically and it is in the branch of https://github.com/xiki-tempula/alchemlyb/tree/workf With an API like

    >>> import os
    >>> from alchemtest.gmx import load_ABFE
    >>> from alchemlyb.workflows import ABFE
    >>> # Obtain the path of the data
    >>> dir = os.path.dirname(load_ABFE()['data']['complex'][0])
    >>> print(dir)
    'alchemtest/gmx/ABFE/complex'
    >>> workflow = ABFE(units='kcal/mol', software='Gromacs', dir=dir,
    >>>                 prefix='dhdl', suffix='xvg', T=298, out='./')
    >>> workflow.run(skiptime=10, uncorr='dhdl', threshold=50,
    >>>              methods=('mbar', 'bar', 'ti'), overlap='O_MBAR.pdf',
    >>>              breakdown=True, forwrev=10)

Do you mind have a try and give me some feedback?

andresilvapimentel commented 2 years ago

I got this error message: ModuleNotFoundError Traceback (most recent call last) in () 1 import os 2 from alchemtest.gmx import load_ABFE ----> 3 from alchemlyb.workflows import ABFE 4 # Obtain the path of the data 5 dir = os.path.dirname(load_ABFE()['data']['complex'][0])

ModuleNotFoundError: No module named 'alchemlyb.workflows'

andresilvapimentel commented 2 years ago

I am not sure if you understood what I would like to do it. I just want to convert my personal multiple xvg files (from each lambda) into a dataset (like load_benzene) to work with my notebook. How can I perform this task?

dotsdl commented 2 years ago

Hi @andresilvapimentel, can you share the block of code you are running to load your XVG files, along with the exception/error you get? It's impossible for us to tell what the issue may be without this information.

andresilvapimentel commented 2 years ago

HI @dotsdl , You asked exactly what I am asking for. I do not know how to load my xvg files. I the alchemlyb tutorial, the developers used:

from alchemtest.gmx import load_benzene from alchemlyb.parsing.gmx import extract_dHdl dataset = load_benzene() dataset

My question is how to change the line: dataset = load_benzene()

to upload my personal xvg files. So, my question is: How to convert my personal xvg files to a dataset?

dotsdl commented 2 years ago

Ah, now I see what you mean. I'll be the first to admit that we could use some clear examples in the documentation, but I believe you are referring to the example given in this section of the docs:

from alchemtest.gmx import load_benzene
from alchemlyb.parsing.gmx import extract_dHdl

dataset = load_benzene()

dhdl = extract_dHdl(dataset['data']['Coulomb'][0], 310)

dhdl.attrs['temperature']
dhdl.attrs['energy_unit']

If you run the dataset = load_benzene() line and have a look at the dataset object, you'll see that it's basically a dictionary, and dataset['data'] is a dictionary with two keys, 'Coulomb', and 'VDW'. The values for those keys are a list of paths to the example XVG files for benzene for each lambda window.

In the extract_dHdl line we parse the first such XVG file by passing its path in. So, to use your own XVG file, you need only pass your XVG file path to extract_dHdl, along with the temperature at which your simulation was sampled.

Does this help?

andresilvapimentel commented 2 years ago

Yes. It helped. Thanks.

I upload the xvg files with: from google.colab import files uploaded = files.upload()

Then, I alocated uploaded as dataset:

dataset = {'data':{'Coulomb':[],'VDW':[]}} for (k,v) in uploaded.items(): dataset['data']['Coulomb'].append(k) dataset['data']['VDW'].append(k)

Now, the issue is in the df line:

data_list = [extract_u_nk(xvg, T=310) for xvg in dataset['Coulomb']] df = forward_backward_convergence(data_list, 'mbar') ax = plot_convergence(df) ax.figure.savefig('dF_t.pdf')

The error message is: ParameterError Traceback (most recent call last) in () 5 #bz = dataset['data'] 6 data_list = [extract_u_nk(xvg, T=310) for xvg in dataset['Coulomb']] ----> 7 df = forward_backward_convergence(data_list, 'mbar') 8 ax = plot_convergence(df) 9 ax.figure.savefig('dF_t.pdf')

358 raise ParameterError( 359 'Warning: Should have \sum_n W_nk = 1. Actual column sum for state %d was %f. %d other columns have similar problems' % --> 360 (firstbad, column_sums[firstbad], np.sum(badcolumns))) 361 362 row_sums = np.sum(W * N_k, axis=1)

ParameterError: Warning: Should have \sum_n W_nk = 1. Actual column sum for state 0 was 18.391322. 41 other columns have similar problems

This error also shows up in the following lines:

mbar_coul.fit(u_nk_coul)

andresilvapimentel commented 2 years ago

The other error message shows up when I run the command block:

dHdl_coul = alchemlyb.concat([extract_dHdl(xvg, T=310) for xvg in dataset['Coulomb']]) ti_coul = TI().fit(dHdl_coul) dHdl_vdw = alchemlyb.concat([extract_dHdl(xvg, T=310) for xvg in dataset['VDW']]) ti_vdw = TI().fit(dHdl_vdw)

from alchemlyb.visualisation import plot_ti_dhdl ax = plot_ti_dhdl([ti_coul, ti_vdw], labels=['Coul', 'VDW'], colors=['r', 'g']) ax.figure.savefig('dhdl_TI.pdf')

The error message is:

ValueError Traceback (most recent call last) in () 10 11 from alchemlyb.visualisation import plot_ti_dhdl ---> 12 ax = plot_ti_dhdl([ti_coul, ti_vdw], labels=['Coul', 'VDW'], colors=['r', 'g']) 13 ax.figure.savefig('dhdl_TI.pdf')

/usr/local/lib/python3.7/dist-packages/alchemlyb/visualisation/ti_dhdl.py in plot_ti_dhdl(dhdl_data, labels, colors, units, ax) 95 raise ValueError( 96 'Length of labels ({}) should be the same as the number of data ({})'.format( ---> 97 len(labels), len(dhdl_list))) 98 99 if colors is None:

ValueError: Length of labels (2) should be the same as the number of data (4)

Do you know what is going on?

dotsdl commented 2 years ago

From what you shared, it looks like:

dataset = {'data':{'Coulomb':[],'VDW':[]}}
for (k,v) in uploaded.items():
dataset['data']['Coulomb'].append(k)
dataset['data']['VDW'].append(k)

data_list = [extract_u_nk(xvg, T=310) for xvg in dataset['Coulomb']]

should have instead for the last line:

data_list = [extract_u_nk(xvg, T=310) for xvg in dataset['data']['Coulomb']]

Does this fix your issue?

andresilvapimentel commented 2 years ago

Hi @dotsdl Thank you for your suggestion.

No. I did not fix it. It is even worse because it gives the error message in the previous line:

KeyError Traceback (most recent call last) in () 3 from alchemlyb.convergence import forward_backward_convergence 4 ----> 5 data_list = [extract_u_nk(xvg, T=310) for xvg in dataset['data']['Coulomb']] 6 df = forward_backward_convergence(data_list, 'mbar') 7 ax = plot_convergence(df)

KeyError: 'data'

Yesterday, I found a couple additional issues with this uploading process. I will try a couple more things before sharing with you.

xiki-tempula commented 2 years ago

Yes. It helped. Thanks.

I upload the xvg files with: from google.colab import files uploaded = files.upload()

Then, I alocated uploaded as dataset:

dataset = {'data':{'Coulomb':[],'VDW':[]}} for (k,v) in uploaded.items(): dataset['data']['Coulomb'].append(k) dataset['data']['VDW'].append(k)

Now, the issue is in the df line:

data_list = [extract_u_nk(xvg, T=310) for xvg in dataset['Coulomb']] df = forward_backward_convergence(data_list, 'mbar') ax = plot_convergence(df) ax.figure.savefig('dF_t.pdf')

The error message is: ParameterError Traceback (most recent call last) in () 5 #bz = dataset['data'] 6 data_list = [extract_u_nk(xvg, T=310) for xvg in dataset['Coulomb']] ----> 7 df = forward_backward_convergence(data_list, 'mbar') 8 ax = plot_convergence(df) 9 ax.figure.savefig('dF_t.pdf')

358 raise ParameterError( 359 'Warning: Should have \sum_n W_nk = 1. Actual column sum for state %d was %f. %d other columns have similar problems' % --> 360 (firstbad, column_sums[firstbad], np.sum(badcolumns))) 361 362 row_sums = np.sum(W * N_k, axis=1)

ParameterError: Warning: Should have \sum_n W_nk = 1. Actual column sum for state 0 was 18.391322. 41 other columns have similar problems

This error also shows up in the following lines:

mbar_coul.fit(u_nk_coul)

I think this could be solved by using autoMBAR instead of MBAR from alchemlyb.estimators import AutoMBAR as MBAR

xiki-tempula commented 2 years ago

The other error message shows up when I run the command block:

dHdl_coul = alchemlyb.concat([extract_dHdl(xvg, T=310) for xvg in dataset['Coulomb']]) ti_coul = TI().fit(dHdl_coul) dHdl_vdw = alchemlyb.concat([extract_dHdl(xvg, T=310) for xvg in dataset['VDW']]) ti_vdw = TI().fit(dHdl_vdw)

from alchemlyb.visualisation import plot_ti_dhdl ax = plot_ti_dhdl([ti_coul, ti_vdw], labels=['Coul', 'VDW'], colors=['r', 'g']) ax.figure.savefig('dhdl_TI.pdf')

The error message is:

ValueError Traceback (most recent call last) in () 10 11 from alchemlyb.visualisation import plot_ti_dhdl ---> 12 ax = plot_ti_dhdl([ti_coul, ti_vdw], labels=['Coul', 'VDW'], colors=['r', 'g']) 13 ax.figure.savefig('dhdl_TI.pdf')

/usr/local/lib/python3.7/dist-packages/alchemlyb/visualisation/ti_dhdl.py in plot_ti_dhdl(dhdl_data, labels, colors, units, ax) 95 raise ValueError( 96 'Length of labels ({}) should be the same as the number of data ({})'.format( ---> 97 len(labels), len(dhdl_list))) 98 99 if colors is None:

ValueError: Length of labels (2) should be the same as the number of data (4)

Do you know what is going on?

I think this is a tricky issue, the plot_ti_dhdl assumes a certain form of input data, I might need your file to see what is the best way forward.

andresilvapimentel commented 2 years ago

Hi @xiki-tempula Thank you for your comment. I also think it is a trick issue. It seems it is related to the indexes of my dataset (and also the shapes of dhdl, dhdl_coul, dhdl_vdw, u_nk_coul and u_nk_vdw).

I also found that the data_list generated in my notebook is in a format different for what is required for the code works well. The command line: df = forward_backward_convergence(data_list, 'mbar') also gives the same convergence error message. So, the use of from alchemlyb.estimators import AutoMBAR as MBAR is not going to help.

I would like to suggest something. My recomendation is to work in a module to read the xvg files and generate a dataset with the right for of input data to avoid this issue.

I will work a little bit more. If I do not solve the issue, I will send my notebook and dataset to get further help. Which file do you need? The xvg files are big.

xiki-tempula commented 2 years ago

@andresilvapimentel So the alchemlyb is intended to function as a library instead of offering an end-to-end solution, so technically, all the components to do a convergence analysis are available. We can obviously giving advice on how to use them.

andresilvapimentel commented 2 years ago

I tried to use from alchemlyb.estimators import AutoMBAR as MBAR but It did not work as well giving the same error message about convergence.

xiki-tempula commented 2 years ago

@andresilvapimentel If the AutoMBAR cannot provide a solution, I'm afraid that it is out of the scoop of this repo. Our AutoMBAR is just a wrapper which tries different MBAR solvers to get it to work. In most cases, the AutoMBAR can find the right solver to solve the MBAR but ultimately, it is the pymbar that is doing the solving. Maybe @dotsdl could consult with the pymbar people to see what is the best way forward.

andresilvapimentel commented 2 years ago

@xiki-tempula How can I send the xvg files in order to help me with solution? What is the best way? Maybe, I am generating the xvg files different than you. Do you aggree with me?

xiki-tempula commented 2 years ago

@andresilvapimentel It is proprietary? If it is not, you could just put it in the comment section. Like this untitled folder.zip

andresilvapimentel commented 2 years ago

@xiki-tempula It is my own xvg files: dhdl.zip

I would be very thankful if you can solve the issue.

xiki-tempula commented 2 years ago

@andresilvapimentel Ok, I guess you need the devel version, follow the guide in https://alchemlyb.readthedocs.io/en/latest/install.html#installing-from-source

I could get the result with the devel version by

import glob
from alchemlyb.parsing.gmx import extract_u_nk
from alchemlyb.convergence import forward_backward_convergence
u_nk_list = [extract_u_nk(file, 300) for file in glob.glob('*.xvg')]
df = forward_backward_convergence(u_nk_list, 'autombar')

Note that your case seems to be very difficult to solve, so the calculation will take a long time.

andresilvapimentel commented 2 years ago

The u_nk_list used by you has the same form of the data_list = [extract_u_nk(xvg, T=310) for xvg in dataset['Coulomb']] that I used, so it gave the same error message using the mbar.

I did not use the autombar because I need to install it. I did not try yet.

andresilvapimentel commented 2 years ago

It did not converge after 1 h when the devel version was intalled.

WARNING: Did not converge to within specified tolerance. max_delta = 6.668503e-09, tol = 1.000000e-12, maximum_iterations = 10000, iterations completed = 9999

andresilvapimentel commented 2 years ago

As I did not solve the issue, I run the tutorial of the mdpow program using the benzene example to get the FEP data (from the xvg files). Then, I uploaded these xvg files into the alchemlyp notebook to try running this benzene example there. I got the following errors: 1) with the command df = forward_backward_convergence(data_list, 'mbar')

LinAlgError Traceback (most recent call last) in () ----> 1 df = forward_backward_convergence(data_list, 'mbar')

<__array_function__ internals> in eigh(*args, **kwargs) [/usr/local/lib/python3.7/dist-packages/numpy/linalg/linalg.py](https://localhost:8080/#) in _raise_linalgerror_eigenvalues_nonconvergence(err, flag) 92 93 def _raise_linalgerror_eigenvalues_nonconvergence(err, flag): ---> 94 raise LinAlgError("Eigenvalues did not converge") 95 96 def _raise_linalgerror_svd_nonconvergence(err, flag): LinAlgError: Eigenvalues did not converge 2) with the command mbar_coul.fit(u_nk_coul) LinAlgError Traceback (most recent call last) [](https://localhost:8080/#) in () 1 mbar_coul = MBAR() ----> 2 mbar_coul.fit(u_nk_coul) [/usr/local/lib/python3.7/dist-packages/numpy/linalg/linalg.py](https://localhost:8080/#) in _raise_linalgerror_eigenvalues_nonconvergence(err, flag) 92 93 def _raise_linalgerror_eigenvalues_nonconvergence(err, flag): ---> 94 raise LinAlgError("Eigenvalues did not converge") 95 96 def _raise_linalgerror_svd_nonconvergence(err, flag): LinAlgError: Eigenvalues did not converge 3) with the command mbar_coul = MBAR().fit(u_nk_coul) Warning: BAR is likely to be inaccurate because of poor overlap. Improve the sampling, or decrease the spacing betweeen states. For now, guessing that the free energy difference is 0 with no uncertainty. LinAlgError Traceback (most recent call last) [](https://localhost:8080/#) in () 3 bar_coul = BAR().fit(u_nk_coul) 4 bar_vdw = BAR().fit(u_nk_vdw) ----> 5 mbar_coul = MBAR().fit(u_nk_coul) <__array_function__ internals> in eigh(*args, **kwargs) [/usr/local/lib/python3.7/dist-packages/numpy/linalg/linalg.py](https://localhost:8080/#) in _raise_linalgerror_eigenvalues_nonconvergence(err, flag) 92 93 def _raise_linalgerror_eigenvalues_nonconvergence(err, flag): ---> 94 raise LinAlgError("Eigenvalues did not converge") 95 96 def _raise_linalgerror_svd_nonconvergence(err, flag): LinAlgError: Eigenvalues did not converge It is weird this warning message (Warning: BAR is likely to be inaccurate because of poor overlap. Improve the sampling, or decrease the spacing betweeen states. For now, guessing that the free energy difference is 0 with no uncertainty.) because the tutorial should not have this issue. Can you help me, please? I would appreciate if you can help me because I do not have any idea how to solve this issue. Thanks.
orbeckst commented 2 years ago

The benzene example for MDPOW is almost certainly not converged, the simulations are too short.

I would suggest you start out with a simple system where you know that you can get converged data. This can be benzene but you should run it for long enough (probably a few ns per window — but ultimately you should do some convergence analysis). Or use the GROMACS FEP tutorial. In any case, you should have a test case that you trust.

Manually calculating the free energy difference over the whole data set as you did in (2) and (3) is a good start. Try different estimators. Adjust convergence parameters. Then do the same manually including only half the data. See if anything breaks in between.

Ultimately, alchemlyb is not a plug-and-play solution. It provides building blocks and you need to develop an understanding of what the building blocks can do. I suggest reading recent papers on FEP calculations (Justin Lemkuhl's GROMACS tutorial (1) and the best practices paper (2).

  1. J. Lemkul. From proteins to perturbed Hamiltonians: A suite of tutorials for the Gromacs-2018 molecular simulation package [article v1.0]. Living Journal of Computational Molecular Science, 1(1):5068–, 10 2018. DOI: 10.33011/livecoms.1.1.5068
  2. Antonia S J S Mey, Bryce K Allen, Hannah E Bruce Macdonald, John D Chodera, David F Hahn, Maximilian Kuhn, Julien Michel, David L Mobley, Levi N Naden, Samarjeet Prasad, Andrea Rizzi, Jenke Scheen, Michael R Shirts, Gary Tresadern, Huafeng Xu. Best Practices for Alchemical Free Energy Calculations [Article v1.0].. Living J Comput Mol Sci 2 (1) 2020, . [PubMed:34458687] [DOI]
mrshirts commented 2 years ago

@orbeckst @xiki-tempula , I've started working on moving testing different solutions within MBAR at https://github.com/choderalab/pymbar/pull/442. Would be good to get any comments on what you would suggest.

orbeckst commented 2 years ago

As a note: Google Colab is Python version 3.7.13 (default, Apr 24 2022, 01:04:09) (as of this writing). alchemlyb will likely drop Python 3.7 support soon (in line with NEP29, see e.g. PR #214 ) to keep up with dependencies such as Pandas (supports 3.8 – 3.10 in their latest release).

Therefore, unless Google updates Colab, it seems pointless to create tutorials in Colab if we cannot use the latest version of alchemlyb.

orbeckst commented 2 years ago

Minimum requirement for alchemlyb is 3.8 so it won't run in Colab until Google changes the version of Python.

Please re-open once Colab has been modernized.