Closed LAndersen1 closed 5 months ago
I think it makes sense to solve the problem directly in easydock, because easydock takes care of (nearly) all preprocessing steps for docking (like 3D-coordinate generation, protonation, computation of partial charges, etc...)
The following code would solve the problem for most of the cases:
import rdkit.Chem.SaltRemover
remover = rdkit.Chem.SaltRemover.SaltRemover()
mol = remover.StripMol(mol,dontRemoveEverything=True)
Thank you for reporting the issue and the suggested solution.
I see that the issue may be broader. The list of salts is very short in rdkit. So, it is still possible that an input multi-component molecule will remain multi-component after SaltRemover
. Therefore, we may think about a little bit more general solution which will be also explicit for a user.
During the preparation step we can remove salts and check whether an output molecule is still multi-component or not. If yes, we will omit it completely. In both cases, salt removal and molecule omitting, we will report this to STDERR. The downside of this solution is that we will modify input SMILES keeping molecule id intact and a user, if forgot to look at STDERR, may be surprised that input SMILES in the database differ from initial ones.
I would like to make this step even more explicit but I do not know how to do that.
I added this feature to master
. Now, by default all input molecules will be checked for the number of components and SaltRemover
will be applied if there is more than one component. If this will result in a multi-component molecule again, the whole molecule will be skipped (even not written to DB).
I guess one suggestion to make it more explicit may be to add another column in the .db
file with a column name like 1) is_smi_edited
between smi
and smi_protonated
column and filled it with True
and False
or 2) smi_input
column before the smi
column.
smi_input
may be a good idea. However, in this case for complementarity we have to add source_mol_block_input
to handle SDF inputs.
init_db
.I guess that we may edit generate_init_data
. However, this will add two more output fields smi_input
and source_mol_block_input
. If salt stripping fails then values of returned smi
and source_mol_block
fields should be None
.
@Feriolet, may I ask you to try to implement this?
I hope this will not complicate the code too much, because what I'm currently thinking, this will create another bifurcation point in return statement of generate_init_data
, as return None
should be replaced with two return
s for smi
and source_mol_block
input.
Sure I can try to add it in the generate_init_data
and init_db
. One question though, does sdf input go to the source mol block
? In my experience, the sdf input goes to the smi
column, so I am actually unsure what input goes to the source mol
SDF input goes to source_mol_block
if this is a 3D molecule, otherwise to smi
. There is a function mol_is_3d
which can be used to check each molecule.
I have tried to change both init
function. Please see the image if I have interpret it correctly before I make the PR.
Smiles used:
c1cn(nn1)[C@H]2C[NH+](C[C@H]2NC(=O)C3CC4(C3)CCC4)CCCO
CC1=C(N2C=CC=NC2=N1)C3=CSC(=N3)NC4=CC=C(C=C4)O.Br
[NH-]C(=O)c1ccc[n+]([C@@H]2O[C@H](CO[P@](=O)([O-])O[P@](=O)([O-])OC[C@H]3O[C@@H](n4cnc5c(N)ncnc54)[C@H](O)[C@@H]3O)[C@@H](O)[C@H]2O)c1 NAD 0
CC1=CC=C(C=C1)C2=[N+](SC(=N2)N3CCOCC3)C4=CC=CC=C4.[O-]Cl(=O)(=O)=O
CCCC[C@@H](CN[C@@H](CCCC)C(=O)[N-][C@@H](CCC([NH-])=O)C(=O)[N-][C@@H](CCCNC(N)=[NH2+])C([NH-])=O)[N-]C(=O)[C@@H]([N-]C(=O)[C@@H]([N-]C(C)=O)[C@@H](C)O)[C@@H](C)CC 2NC 0
Look good! The only minor issue is missing id
for a skipped molecule. I expected that we would keep it, as it is an integral part of input data.
Ok, I have made a PR to add the id
field as well.
Excellent! Thank you
While running easydock, I figured that if there are salts inside a ligand that is being docked (the salt part is marked in the smiles via a .), it causes the following error:
This error is generated through molecules like this
CC1=C(N2C=CC=NC2=N1)C3=CSC(=N3)NC4=CC=C(C=C4)O.Br
CC1=CC=C(C=C1)C2=[N+](SC(=N2)N3CCOCC3)C4=CC=CC=C4.[O-]Cl(=O)(=O)=O
and happens for nearly all (of my tested) molecules that contain a salt group (indicated via the .Br at the end of the smiles string)