The problem with pymbar analysis.

9527567 commented 3 months ago

Perhaps I should reopen an issue and describe it here.

egallicc commented 3 months ago

I have never used pymbar. However, here are some comments/suggestions:

MBAR requires the bias energy to be expressed in kT units. For example:

def bias_fcn(epert, beta, lam1, lam2, alpha, u0, w0):
# ALPHA, U0, W0COEFF
# Initialize ebias1 with zeros
ebias1 = np.zeros_like(epert)

# Check if alpha is greater than 0
if alpha > 0:
    ee = 1 + np.exp(-alpha*(epert - u0))
    ebias1 = (lam2 - lam1) * np.log(ee) / alpha

return beta * ( ebias1 + lam2 * epert + w0 )

where beta = 1/(kB T). AToM's .out files express energies in kcal/mol units, so for T=300 K, beta = 1./(1.986e-3*300.0) for all samples and states.

Multi-state free energy inference is based on calculating the bias energy of all samples at all states. I think your code does it only for the state in which the sample was collected. If the perturbation energy samples are collected in usamples, something like this would calculate the bias matrix u_kn required by pymbar.MBAR():
```
nstates = len(lambda1)
u_kn = np.zeros((nstates, len(usamples)))
for k in range(nstates):
  u_kn[k] = bias_fcn(usamples, bet[k], lambda1[k], lambda2[k], alpha[k], u0[k], w0[k])
```
The second argument of pymbar.MBAR(), N_k is the number of samples collected at each state k. See below.
The free energy analysis is performed for the two ATM legs separately and the binding free energy is the difference of the corresponding free energies. In your case, leg1's states correspond to DIRECTION > 0 (the first 11 states) and leg2's to DIRECTION < 0 (the second 11 states). So, for example, this will give you the lambda1's values for leg1:
```
lambda1 = np.array(config['LAMBDA1'], dtype=np.float64)[0:11]
```
And something like this would collect only leg1's samples from the .out files:
```
def read_u_energy(name: str, n_replicas: int = 22):
usamples = []
n_k = [ 0 for i in range(    int(n_replicas/2)  ) ]
for k in range(n_replicas):
    lines = Path(f'r{k}/{name}.out').read_text().split('\n')
    for line in lines:
        values = line.split()
        if len(values) > 0:
            stateid = int( values[0] )
            direction = float( values[2] )
            if direction > 0:
                usamples.append( float( values[9] ) )
                n_k[ stateid ] += 1
return ( np.array(usamples,dtype=np.float64), np.array(n_k,np.int32) )
```
Call it as:
```
(usamples, N_k) = read_u_energy(args.name, 22)
```
The DIRECTION value is stored in the third column. Only samples with DIRECTION > 0 are loaded. The function returns the perturbation energy samples and the number of samples collected at each state. Note that all replicas' .out files are scanned because leg1's samples can be found in any of the 22 replicas, even those started in leg2's states. Also, for production, consider discarding the first portion of the data for equilibration. For example, do for line in lines[20:] above to discard the first 20 samples of each replica.

With the above, I think this would calculate the free energy difference for leg1:

mbar = pymbar.MBAR(u_kn = u_kn, N_k=N_k)
print( mbar.compute_free_energy_differences() )

9527567 commented 2 months ago

Thank you, I have modified the analysis code.

import numpy as np
import argparse
import sys
from configobj import ConfigObj
from pathlib import Path
import pymbar
import pandas as pd
parser = argparse.ArgumentParser()
parser.add_argument('-n', '--name', required=True, help='name')
parser.add_argument('-c', '--config', required=True, help='config file')
parser.add_argument('-s', '--start', required=False,
                    help='start sample', default=0)
parser.add_argument('-e', '--end', required=False,
                    help='end sample', default=sys.maxsize)

args = parser.parse_args()

def bias_fcn(epert, beta, lam1, lam2, alpha, u0, w0):
    # ALPHA, U0, W0COEFF
    # Initialize ebias1 with zeros
    ebias1 = np.zeros_like(epert)

    # Check if alpha is greater than 0
    if alpha > 0:
        ee = 1 + np.exp(-alpha*(epert - u0))
        ebias1 = (lam2 - lam1) * np.log(ee) / alpha

    return beta * (ebias1 + lam2 * epert + w0)

def read_u_energy(name: str, n_replicas: int = 22, start: int = 0, end: int = sys.maxsize):
    lig1_usamples = []
    lig2_usamples = []
    n_k1 = [0 for i in range(int(n_replicas/2))]
    n_k2 = [0 for i in range(int(n_replicas/2))]

    for k in range(n_replicas):
        lines = Path(f'r{k}/{name}.out').read_text().split('\n')
        for line in lines[start: end]:
            values = line.split()
            if len(values) > 0:
                stateid = int(values[0])
                direction = float(values[2])
                if direction > 0:
                    lig1_usamples.append(float(values[9]))
                    n_k1[stateid] += 1
                else:
                    lig2_usamples.append(float(values[9]))
                    n_k2[stateid-11] += 1
    return (np.array(lig1_usamples, dtype=np.float64), np.array(lig2_usamples, dtype=np.float64), np.array(n_k1, np.int32), np.array(n_k2, np.int32))

def read_config(config_file: str):
    config = ConfigObj(config_file)
    return config

if __name__ == "__main__":
    config = read_config(args.config)

    lambda1 = np.array(config['LAMBDA1'], dtype=np.float64)
    lambda2 = np.array(config['LAMBDA2'], dtype=np.float64)
    lambdas = np.array(config['LAMBDAS'], dtype=np.float64)

    alpha = np.array(config['ALPHA'], dtype=np.float64)
    u0 = np.array(config['U0'], dtype=np.float64)
    w0 = np.array(config['W0COEFF'], dtype=np.float64)
    (lig1_usamples, lig2_usamples, N_k1, N_k2) = read_u_energy(
        args.name, 22, int(args.start), int(args.end))
    nstates = 11
    u_kn1 = np.zeros((nstates, len(lig1_usamples)))
    u_kn2 = np.zeros((nstates, len(lig2_usamples)))

    for k in range(nstates):
        u_kn1[k] = bias_fcn(lig1_usamples, 1./(1.986e-3*300.0),
                            lambda1[k], lambda2[k], alpha[k], u0[k], w0[k])
    for k in range(11, 22):
        u_kn2[k-11] = bias_fcn(lig2_usamples, 1./(1.986e-3*300.0),
                               lambda1[k], lambda2[k], alpha[k], u0[k], w0[k])
    u_nk1 = pd.DataFrame(u_kn1)
    u_nk2 = pd.DataFrame(u_kn2)

    mbar1 = pymbar.MBAR(u_kn=u_kn1, N_k=N_k1, verbose=True,
                        solver_protocol='robust')
    mbar2 = pymbar.MBAR(u_kn=u_kn2, N_k=N_k2, verbose=True,
                        solver_protocol='robust')

    a = mbar1.compute_free_energy_differences()
    b = mbar2.compute_free_energy_differences()
    print(a['Delta_f'])
    print(b['Delta_f'])

Furthermore, can we use alchemlyb for analysis? But I don't know how to create its input as follows.

egallicc commented 2 months ago

Thank you.

Sorry, I am not familiar with alchemlyb's input. Presumably, it would be related to the output of bias_fcn() as for MBAR.

Gallicchio-Lab / AToM-OpenMM

The problem with pymbar analysis. #87