ci-lab-cz / easydock

BSD 3-Clause "New" or "Revised" License
38 stars 13 forks source link

Sampling of ring conformations before docking #33

Open DrrDom opened 6 months ago

DrrDom commented 6 months ago

For molecules with saturated ring systems (e.g. sugars) it can be useful to implement sampling of initial ring conformations, dock them separately and choose best output. This looks critically important for such molecules.

Solution 1. This can be implemented in a general way.

  1. Every new state of a molecule should be stored as a separate record. Currently it may be tautomeric state, protonation state and conformation (further states can be added).
  2. These states are enumerated before docking will start during the DB preparation step.
  3. A column state_id should be added to DB to designate different states.
  4. A script which retrieve best results from DB should be adapted to return results for only best scored output state.

Drawbacks:

Alternative solution 1. Every type of a possible state can be designated to its own column in DB, e.g. tautomer_id, protonatrion_id, conf_id. This gives more control over results, but the same drawback will persist.

Solution 2. Starting conformers can be enumerated by ligand_preparation function which will return a list of PBDQT blocks (instead of a single one like now) but only for molecules which require this (comprising saturated rings and maybe something else). After docking the best output is selected among several ones (in mol_dock function) and it is returned and stored in DB. Thus for a user of mol_dock function nothing will change and DB structure can be kept as is.

The number and diversity of generated conformers should be controlled to keep only distinct ones. This may require some experiments to derive empirical thresholds. Issues may also occur with compounds having several saturated rings.

Currently solution 2 looks preferable as it will require less changes.

Feriolet commented 6 months ago

If I interpret this right, will this mean that we are doing a 'multiple conformer docking' for the saturated ring compounds?

Also, would the second solution increase the overall docking time? Some molecules would have over tens of conformers generated by rdkit, so is it possible to shortlist them based over some metrics (maybe like RMSD)?

Having multiple conformers as a separate record is nice to compare them side by side for the analysis, but then it would mean we have to make another id column for it and changing the get_sdf_from_dock_db script similar to the protonation feature last time. I agreed that solution 2 will be more straightforward to solve this problem.

DrrDom commented 6 months ago

I have more thoughts about this issue and the way how to implement this. I like solution 2 as a simpler one. I think that analysis of docking outputs of different starting conformations is a special case and it can be achieved by submitting pre-generated starting conformers as individual molecules (without any changes in the code). The only requirement is to give them different names. Therefore I would prefer to implement the second approach trying to avoid modification of the database structure.

The most tricky thing will be the selection of conformers. It is easy to generate multiple conformers instead of a single one in mol_embedding_3d for molecules with saturated rings, but we are interested in conformers only with diverse ring conformations. The general idea is to sample conformers and select from them only those which have different conformations of saturated ring systems. I have a function (see below) which performs clustering based on pairwise RMSD values between conformers. It should be modified to take into account only ring atoms.

from sklearn.cluster import AgglomerativeClustering
import numpy as np
from rdkit.Chem.AllChem import GetConformerRMSMatrix
from rdkit import Chem

def remove_confs_rms(mol, rms=0.25, keep_nconf=None):
    """
    The function uses AgglomerativeClustering to select conformers.

    :param mol: input molecule with multiple conformers
    :param rms: discard conformers which are closer than given value to a kept conformer
    :param keep_nconf: keep at most the given number of conformers. This parameter has precedence over rms
    :return:
    """

    def gen_ids(ids):
        for i in range(1, len(ids)):
            for j in range(0, i):
                yield j, i

    if keep_nconf and mol.GetNumConformers() <= keep_nconf:
        return mol

    if mol.GetNumConformers() <= 1:
        return mol

    mol_tmp = Chem.RemoveHs(mol)   # calc rms for heavy atoms only
    rms_ = GetConformerRMSMatrix(mol_tmp, prealigned=False)

    cids = [c.GetId() for c in mol_tmp.GetConformers()]
    arr = np.zeros((len(cids), len(cids)))
    for (i, j), v in zip(gen_ids(cids), rms_):
        arr[i, j] = v
        arr[j, i] = v
    if keep_nconf:
        cl = AgglomerativeClustering(n_clusters=keep_nconf, linkage='complete', metric='precomputed').fit(arr)
    else:
        cl = AgglomerativeClustering(n_clusters=None, linkage='complete', metric='precomputed', distance_threshold=rms).fit(arr)

    keep_ids = []
    for i in set(cl.labels_):
        ids = np.where(cl.labels_ == i)[0]
        j = arr[np.ix_(ids, ids)].mean(axis=0).argmin()
        keep_ids.append(cids[j])
    remove_ids = set(cids) - set(keep_ids)

    for cid in sorted(remove_ids, reverse=True):
        mol_tmp.RemoveConformer(cid)

    return mol_tmp

The function GetConformerRMSMatrix has an argument atomIds where we may pass ids of rings but they will be used only for alignment but not for calculation of RMSD. Therefore, we may copy-paste this function and GetConformerRMS function which is used inside and modify them to use atomsIds for RMSD calculation as well. This is very easy. In the latter function the cycle below should be made over atomIds not range of all atoms (the number of atoms should be modified accordingly).

  ssr = 0
  for i in range(mol.GetNumAtoms()):
    d = conf1.GetAtomPosition(i).Distance(conf2.GetAtomPosition(i))
    ssr += d * d
  ssr /= mol.GetNumAtoms()
  return numpy.sqrt(ssr)

Afterwards, we have to test which values of arguments rms or keep_nconf should be used by default and set these values. I would like to keep the interface as simple as possible and an inexperienced user nevertheless will not be able to choose proper values. We may provide in the command line an argument sample_rings asTrue/False and everything else will use the default parameters or run ring sampling always (this will be much simpler for implementation).

The more tricky thing is to implement treatment of compounds with multiple rings. We cannot directly compute RMSD between corresponding atoms of conformers in all rings at the same time, because relative position of rings will depend on a conformation of a linker, but we are interested only in differences in conformations of rings alone. So, we would need to modify remove_confs_rms.

  1. We can get the set of rings by Chem.GetSymmSSSR(mol), which returns the nested list of atom ids of smallest rings.
  2. Keep only rings which have sp3-hybrydized atoms.
  3. For every remaining ring we compute RMSD after alignment by this ring.
  4. Afterwards we "average" RMSD across all rings in a molecule. This will not be a mathematical average, because rings may have different number of atoms, but it is easy to compute by backtracking the RMSD formula.

There may be a need to make some experiments to determine the number of conformers which is reasonable to keep for molecules with different ring sizes and number of rings. For example, instead of a hard coded value of keep_nconf we may calculate it on the fly depending on a molecule. Additionally rms parameter should be considered to avoid selection of too similar conformers. Using of rms as a single threshold may be less useful, because does not provide clear understanding on the number of conformers and may result in too many starting conformers and unpredictable run time.

To understand which parameters may be reasonable, we may play with mono-, di- and trisaccharides, spirosystems, triterpenes and condensed saturated and aromatic ring systems. But first we need to implement this protocol.

Some additional notes. It would be desirable to make implementation as abstract and modular as possible to be able to re-use in future by simple import of corresponding functions. The current major entry point is ligand_preparation, but individual new functions should be self-consistent. If in future we will implement other docking programs with a different ligand preparation step, we should easily re-use these functions to implement the same functionality. Unfortunately all these changes makes incorporation of new docking programs a little bit more difficult, but I have to create a description of requirements of introduction of new docking programs to make it clear (in some future).

@Feriolet, if you would be able to help with that, I'll be very happy. This will help us in our current project and will be useful to other users. If you will do this, I would like to ask you to disclose your identity to make a corresponding acknowledgment in the paper which I hope we will submit soon. You may send me email to pavel_polishchuk@ukr.net.

Feriolet commented 6 months ago

I would be more than happy to help for this feature. I have sent you an email for it.

It seems that the parameter tuning for rms and keep_nconf will be quite tricky depending on the ring input. I think it would be ideal if we can predict the keep_nconf value depending on the input to put into the AgglomerativeClustering class (if it is a 6-membered ring, we can maybe expect at most around 4 conformations: chair, boat, half-boat, twisted-boat). I don't think it is good if a 30-conformer molecule can be represented by 3 conformer, but we hard-coded keep_nconf to 10, for example.

The function GetConformerRMSMatrix has an argument atomIds where we may pass ids of rings but they will be used only for alignment but not for calculation of RMSD. Therefore, we may copy-paste this function and GetConformerRMS function which is used inside and modify them to use atomsIds for RMSD calculation as well. This is very easy. In the latter function the cycle below should be made over atomIds not range of all atoms (the number of atoms should be modified accordingly).

If I understand correctly, we would modify the GetConformerRMSMatrix function to align only the saturated ring and then calculate the RMSD of the ring only, right?

DrrDom commented 6 months ago

Thank you for your help and contribution!

If I understand correctly, we would modify the GetConformerRMSMatrix function to align only the saturated ring and then calculate the RMSD of the ring only, right?

Yes. We have to adjust GetConformerRMSMatrix and GetConformerRMS to calculate RMSD based on atomIds.

Regarding parameters to choose, I agree that it would be better to estimate them for each molecule individually. I think that we can somewhat modify remove_confs_rms function. We may first apply rms criterion to select diverse conformers based on RMSD and after that apply 'keep_nconf' criterion to select a reduced subset of diverse conformers. However, we will probably have to limit the maximum number of conformers to some reasonable value. It may become clear after experiments.

Regarding some reasonable 'keep_nconf' number. Even for 6-membered rings it may be enough to have just 2 conformers: chair and twisted, which I met most frequently in X-ray. I expect RDKit incorporates this empirical knowledge and generates conformers more aligned to X-ray structures.

Besides of conformations of a ring itself, it is desirable that these conformations will discriminate orientation of substituents (e.g. axial or equatorial for 6-membered rings). So it may be reasonable to add to the list of atoms for every cycle the atoms which are attached to this ring to better sample conformations. We also have macrocycles, which is tricky to sample. However, I propose to limit the current implementation on 7-membered rings maximum and apply the same criteria as for 7-membered ring systems to larger cycles.

DrrDom commented 6 months ago

I found an error in a function remove_confs_rms. I updated the function. The error was in keep_ids.append(ids[j]), should be keep_ids.append(cids[j])

Feriolet commented 6 months ago

I see. Yes, I did not consider the axial-equitorial differences for each of the conformation itself, which is quite important too.

I think I got the general grasp of what we should do for the feature. How can I start experimenting the feature? Should I make a PR for a functional implementation first before optimising the rms and keep_nconf?

DrrDom commented 6 months ago

Yes, you can make a PR with an implemented pipeline and necessary changes. Afterwards we will implement conformer selection and will adjust parameters. The latter should not change the pipeline, it may change only interface of functions by adding some new arguments.

Feriolet commented 6 months ago

I am a bit stuck on how do I convert mol with multiple conformers to molblock. If I understand it correctly, the mol object is converted to pdbqt and then to molblock object. When converting to pdbqt, I assume the conformers information is not retained.

So, do I loop over the conformer id in the mk_prepare_ligand(mol) function? I also assume that this will also change how we should treat the function that receives the pdbqt_string variable like the pdbqt2molblock function

def mk_prepare_ligand(mol, verbose=False):
    pdbqt_string_list = []
    preparator = MoleculePreparation(hydrate=False, flexible_amides=False, rigid_macrocycles=True, min_ring_size=7, max_ring_size=33)
    try:
        cids = [x.GetId() for x in mol.GetConformers()]
        for cid in cids:
            mol_setups = preparator.prepare(mol, conformer_id=cid)
            for setup in mol_setups:
                pdbqt_string, is_ok, error_msg = PDBQTWriterLegacy.write_string(setup)
                pdbqt_string_list.append(pdbqt_string)
                if verbose:
                    print(f"{setup}")
    except Exception:
        sys.stderr.write(
            "Warning. Incorrect mol object to convert to pdbqt. Continue. \n"
        )
        traceback.print_exc()
        pdbqt_string = None

    return pdbqt_string_list
DrrDom commented 6 months ago

The code is correct. There will not be many changes in other functions. The function ligand_preparation will return a list of pdbqt blocks. In the function mol_dock you may create a for loop and iterate over each pdbqt block, run docking sequentially and collect output from each run. The final step will be selection of best output and return them along with docking time, which will now be computed as a total time over all runs.

# part of mol_dock function
start_time = timeit.default_timer()
for ligand_pdbqt in pdbqt_list:
    output_fd, output_fname = tempfile.mkstemp(suffix='_output.json', text=True)
    ligand_fd, ligand_fname = tempfile.mkstemp(suffix='_ligand.pdbqt', text=True)

    try:
        with open(ligand_fname, 'wt') as f:
            f.write(ligand_pdbqt)

        p = os.path.realpath(__file__)
        python_exec = sys.executable
        cmd = f'{python_exec} {os.path.dirname(p)}/vina_dock_cli.py -l {ligand_fname} -p {config["protein"]} ' \
              f'-o {output_fname} --center {" ".join(map(str, config["center"]))} ' \
              f'--box_size {" ".join(map(str, config["box_size"]))} ' \
              f'-e {config["exhaustiveness"]} --seed {config["seed"]} --nposes {config["n_poses"]} -c {config["ncpu"]}'
        subprocess.run(cmd, shell=True, check=True, capture_output=True, text=True)  # this will trigger CalledProcessError and skip next lines)
        # this block should be revised, we can read all outputs to a list of dicts and after the for loop ends select the best result and add dock_time
        with open(output_fname) as f:
            res = f.read()
            if res:
                res = json.loads(res)
                mol_block = pdbqt2molblock(res['poses'].split('MODEL')[1], mol, mol_id)
                output = {'docking_score': res['docking_score'],
                          'pdb_block': res['poses'],
                          'mol_block': mol_block,
                          'dock_time': dock_time}   # this will not be there, it will be added at the end of the function

dock_time = round(timeit.default_timer() - start_time, 1)

# select best output and add dock_time