ci-lab-cz / easydock

BSD 3-Clause "New" or "Revised" License
35 stars 13 forks source link

Salts in molecules cause problems #30

Closed LAndersen1 closed 5 months ago

LAndersen1 commented 5 months ago

While running easydock, I figured that if there are salts inside a ligand that is being docked (the salt part is marked in the smiles via a .), it causes the following error:

Traceback (most recent call last): File ".../lib/python3.11/site-packages/easydock/preparation_for_docking.py", line 158, in pdbqt2molblock rdkitmol_list = RDKitMolCreate.from_pdbqt_mol(pdbqt_mol) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File ".../lib/python3.11/site-packages/meeko/rdkit_mol_create.py", line 224, in from_pdbqt_mol mol = cls.add_pose_to_mol(mol, coordinates, index_map) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File ".../lib/python3.11/site-packages/meeko/rdkit_mol_create.py", line 322, in add_pose_to_mol raise RuntimeError("Only H allowed to be in SMILES but not in PDBQT") RuntimeError: Only H allowed to be in SMILES but not in PDBQT Parsing PDB was failed: 9134_0

This error is generated through molecules like this

CC1=C(N2C=CC=NC2=N1)C3=CSC(=N3)NC4=CC=C(C=C4)O.Br

CC1=CC=C(C=C1)C2=[N+](SC(=N2)N3CCOCC3)C4=CC=CC=C4.[O-]Cl(=O)(=O)=O

and happens for nearly all (of my tested) molecules that contain a salt group (indicated via the .Br at the end of the smiles string)

LAndersen1 commented 5 months ago

I think it makes sense to solve the problem directly in easydock, because easydock takes care of (nearly) all preprocessing steps for docking (like 3D-coordinate generation, protonation, computation of partial charges, etc...)

The following code would solve the problem for most of the cases:

import rdkit.Chem.SaltRemover

remover = rdkit.Chem.SaltRemover.SaltRemover()
mol = remover.StripMol(mol,dontRemoveEverything=True)
DrrDom commented 5 months ago

Thank you for reporting the issue and the suggested solution.

I see that the issue may be broader. The list of salts is very short in rdkit. So, it is still possible that an input multi-component molecule will remain multi-component after SaltRemover. Therefore, we may think about a little bit more general solution which will be also explicit for a user.

During the preparation step we can remove salts and check whether an output molecule is still multi-component or not. If yes, we will omit it completely. In both cases, salt removal and molecule omitting, we will report this to STDERR. The downside of this solution is that we will modify input SMILES keeping molecule id intact and a user, if forgot to look at STDERR, may be surprised that input SMILES in the database differ from initial ones.

I would like to make this step even more explicit but I do not know how to do that.

DrrDom commented 5 months ago

I added this feature to master. Now, by default all input molecules will be checked for the number of components and SaltRemover will be applied if there is more than one component. If this will result in a multi-component molecule again, the whole molecule will be skipped (even not written to DB).

Feriolet commented 5 months ago

I guess one suggestion to make it more explicit may be to add another column in the .db file with a column name like 1) is_smi_edited between smi and smi_protonated column and filled it with True and False or 2) smi_input column before the smi column.

DrrDom commented 5 months ago

smi_input may be a good idea. However, in this case for complementarity we have to add source_mol_block_input to handle SDF inputs.

I guess that we may edit generate_init_data. However, this will add two more output fields smi_input and source_mol_block_input. If salt stripping fails then values of returned smi and source_mol_block fields should be None.

DrrDom commented 5 months ago

@Feriolet, may I ask you to try to implement this?

I hope this will not complicate the code too much, because what I'm currently thinking, this will create another bifurcation point in return statement of generate_init_data, as return None should be replaced with two returns for smi and source_mol_block input.

Feriolet commented 5 months ago

Sure I can try to add it in the generate_init_data and init_db. One question though, does sdf input go to the source mol block? In my experience, the sdf input goes to the smi column, so I am actually unsure what input goes to the source mol

DrrDom commented 5 months ago

SDF input goes to source_mol_block if this is a 3D molecule, otherwise to smi. There is a function mol_is_3d which can be used to check each molecule.

Feriolet commented 5 months ago

I have tried to change both init function. Please see the image if I have interpret it correctly before I make the PR.

Screenshot 2024-04-18 at 6 21 11 PM Screenshot 2024-04-18 at 6 42 38 PM

test_inputfield.zip

Smiles used:

c1cn(nn1)[C@H]2C[NH+](C[C@H]2NC(=O)C3CC4(C3)CCC4)CCCO
CC1=C(N2C=CC=NC2=N1)C3=CSC(=N3)NC4=CC=C(C=C4)O.Br
[NH-]C(=O)c1ccc[n+]([C@@H]2O[C@H](CO[P@](=O)([O-])O[P@](=O)([O-])OC[C@H]3O[C@@H](n4cnc5c(N)ncnc54)[C@H](O)[C@@H]3O)[C@@H](O)[C@H]2O)c1  NAD 0
CC1=CC=C(C=C1)C2=[N+](SC(=N2)N3CCOCC3)C4=CC=CC=C4.[O-]Cl(=O)(=O)=O
CCCC[C@@H](CN[C@@H](CCCC)C(=O)[N-][C@@H](CCC([NH-])=O)C(=O)[N-][C@@H](CCCNC(N)=[NH2+])C([NH-])=O)[N-]C(=O)[C@@H]([N-]C(=O)[C@@H]([N-]C(C)=O)[C@@H](C)O)[C@@H](C)CC  2NC 0
DrrDom commented 5 months ago

Look good! The only minor issue is missing id for a skipped molecule. I expected that we would keep it, as it is an integral part of input data.

Feriolet commented 5 months ago

Ok, I have made a PR to add the id field as well.

DrrDom commented 5 months ago

Excellent! Thank you