ci-lab-cz / easydock

BSD 3-Clause "New" or "Revised" License
35 stars 12 forks source link

Different get_isomers() result when using multiprocessing #25

Closed Feriolet closed 3 months ago

Feriolet commented 4 months ago

I am trying to use multiprocessing to increase the computational speed of running the init_db() function, but I found that there is a mismatch of the result when I use multiprocessing.

Using multiprocessing seems to produce less isomers overall compared when not using it. Below is the function that I use for multiprocessing. While using subprocess increase the time significantly (30 CPUs: 90s for 250k molecules compared to 5h without using it), I was wondering if there is any explanation for why it produces less isomer. I also noticed that some isomers produced without multiprocessing tends to have the same vina score.

How to reproduce(2 'problematic' SMILES):

c1cn(nn1)[C@H]2C[NH+](C[C@H]2NC(=O)C3CC4(C3)CCC4)CCCO
C[NH+]1CCCC[C@H]1C(=O)NCc2cn[nH]c2[C@@H]3CC[NH+](C3)CCOC

run_dock -i smile/train_smiles_final_updated.smi -o docked/multi_train_smiles_final_updated.db --program vina --config config.yml -c 30 --sdf -s 5 --protonation dimorphite

def small_init_db(lists, max_stereoisomers, prefix):
    mol, mol_name = lists
    if prefix:
        mol_name = f'{prefix}-{mol_name}'
    if mol_is_3d(mol):
        smi = Chem.MolToSmiles(mol, isomericSmiles=True)
        return [['mol', (mol_name, 0, smi, Chem.MolToMolBlock(mol))]]

    else:
        isomer_list = []
        isomers = get_isomers(mol, max_stereoisomers)
        for stereo_id, stereo_mol in enumerate(isomers):
            smi = Chem.MolToSmiles(stereo_mol, isomericSmiles=True)
            isomer_list.append(['smi', (mol_name, stereo_id, smi)])
        return isomer_list

def init_db(db_fname, input_fname, ncpu, max_stereoisomers=1, prefix=None):
    start = time.time()

    pool = Pool(processes=ncpu)
    conn = sqlite3.connect(db_fname)
    cur = conn.cursor()
    data_smi = []  # non 3D structures
    data_mol = []  # 3D structures
    prod_x = partial(small_init_db, max_stereoisomers=max_stereoisomers, prefix=prefix)

    result_list = pool.map(prod_x, read_input.read_input(input_fname), chunksize=1)

    for result in result_list:
        try:
            for subresult in result:
                if subresult[0] == 'smi':
                    data_smi.append(subresult[1])
                else:
                    data_mol.append(subresult[1])
        except TypeError:
            print(result)

    cur.executemany(f'INSERT INTO mols (id, stereo_id, smi) VALUES(?, ?, ?)', data_smi)
    cur.executemany(f'INSERT INTO mols (id, stereo_id, smi, source_mol_block) VALUES(?, ?, ?, ?)', data_mol)
    conn.commit()

    end = time.time()
    print(end - start)
DrrDom commented 4 months ago

And what was the output in those cases? Did you use protonated smiles as input?

I noticed that both these structures have positively charged nitrogens which formally becomes chiral after protonation.

Just a wild guess. In multiprocessing some internal properties are discarded and this leads to different sets of isomers. This may be fixed by adding to the main call the following line Chem.SetDefaultPickleProperties(Chem.PropertyPickleOptions.AllProps). I did it but in particular functions, not on the highest level of the program. So, currently this call is not visible to all parts of the program.

Actually it could be a good idea to dock both isomers based on such nitrogens. However, it may require to change the order of protonation and stereosiomer generation or make an additional step to continue enumeration of stereoisomers after protonation for such cases. I would rate this as a minor issue, but may worth to solve it.

Feriolet commented 4 months ago

Yes, the input of the SMILES was protonated. The intended output should have two stereoisomer for the first SMILES and three for the second SMILES.

Anyway, you are right. Calling Chem.SetDefaultPickleProperties(Chem.PropertyPickleOptions.AllProps) to the init_db() function does solve the problem. The result now matches with the current version of easydock.

It can be a good idea to generate stereoisomer after protonation, but would that mean that we may have to add more rows to the existing sql table? Just a thought.

DrrDom commented 4 months ago

It can be a good idea to generate stereoisomer after protonation, but would that mean that we may have to add more rows to the existing sql table? Just a thought.

Yes, it seems this is the easiest way to solve. After protonation (if it was called), check protonated smiles in DB whether they contain undefined chiral centers, generate additional stereoisomers for these compounds and store them to DB as additional rows. However, this should be made optional to not break compatibility with our other tools. This could be a separate function which is called after add_protonation. In this way it will be maximally explicit and flexible.

Feriolet commented 4 months ago

Is there a quick way to identify compounds with undefined chiral center? I tried to find existing function that serve this purpose and they seem to search the atom chirality one by one. I have not tested the function, but it would be nice if it does not run too long for large compounds.

I assume that the pipeline would be:

  1. Use SQL to SELECT protonated_smi from table
  2. SMILES to mol
  3. filter SMILES with Chem.ChiralType.CHI_UNSPECIFIED
  4. rerun it with get_isomer(mol)
  5. Use another function to continue stereo_id (if last stereo_id=5 at the first enumeration, the additional row for second enumeration should have stereo_id = 6,7,...)
  6. Add another row at the very bottom

I initially worry that the order of the smiles would be affected, but I guess it would not matter much since the order was already there in the existing .db file, and using MIN(docking_score) in save_sdf() would pretty much make the order the same.

https://docs.openforcefield.org/projects/toolkit/en/stable/_modules/openff/toolkit/utils/rdkit_wrapper.html

    def _find_undefined_stereo_atoms(rdmol, assign_stereo=False):
        """Find the chiral atoms with undefined stereochemsitry in the RDMol.

        Parameters
        ----------
        rdmol : rdkit.RDMol
            The RDKit molecule.
        assign_stereo : bool, optional, default=False
            As a side effect, this function calls ``Chem.AssignStereochemistry()``
            so by default we work on a molecule copy. Set this to ``True`` to avoid
            making a copy and assigning the stereochemistry to the Mol object.

        Returns
        -------
        undefined_atom_indices : list[int]
            A list of atom indices that are chiral centers with undefined
            stereochemistry.

        See Also
        --------
        rdkit.Chem.FindMolChiralCenters

        """
        from rdkit import Chem

        if not assign_stereo:
            # Avoid modifying the original molecule.
            rdmol = copy.deepcopy(rdmol)

        # Flag possible chiral centers with the "_ChiralityPossible".
        Chem.AssignStereochemistry(rdmol, force=True, flagPossibleStereoCenters=True)

        # Find all atoms with undefined stereo.
        undefined_atom_indices = []
        for atom_idx, atom in enumerate(rdmol.GetAtoms()):
            if atom.GetChiralTag() == Chem.ChiralType.CHI_UNSPECIFIED and atom.HasProp(
                "_ChiralityPossible"
            ):
                undefined_atom_indices.append(atom_idx)
        return undefined_atom_indices
DrrDom commented 4 months ago

A little bit more details

  1. Use SQL to SELECT id, smi, protonated_smi FROM mols WHERE source_mol_block IS NULL (if source_mol_block is not null, than this is a 3D molecule and all stereoconfigurations are defined)
  2. protonated SMILES to mol
  3. Use https://www.rdkit.org/docs/source/rdkit.Chem.html#rdkit.Chem.FindMolChiralCenters and check whether at least one ? is in the output
    Chem.FindMolChiralCenters(Chem.MolFromSmiles('C1C[C@H](C)C(C)[C@H](C)C1'),includeUnassigned=True)
    [(2, 'S'), (4, '?'), (6, 'R')]
  4. if True rerun mol with get_isomer(mol) - with a large number of max_isomers to enumerate all isomers (I do not expect that there will be many such centers in a molecule, so it will be safe)
  5. Implement all these in a separate function to continue stereo_id (if last stereo_id=5 at the first enumeration, the additional row for second enumeration should have stereo_id = 6,7,...)
  6. Add another row at the very bottom - this function should also add non-protonated smiles as well as protonated one. Non-protonated smiles can be taken from the previously added entry.