Closed Feriolet closed 4 months ago
The required implementation is a little bit more complex.
stereo_id
): 1, 2, etcThe former will change the original names that can be considered as unexpected behavior. The latter will require an additional restriction in database table, that the combination of id
and stereo_id
should be unique. The latter choice is preferable from my point of view.
A prototype of the function which will enumerate isomers is below (if the issue with STEREOANY was fixed, than the function can be greatly simplified)
def get_isomers(mol):
opts = StereoEnumerationOptions(tryEmbedding=True, maxIsomers=32)
# this is a workaround for rdkit issue - if a double bond has STEREOANY it will cause errors at
# stereoisomer enumeration, we replace STEREOANY with STEREONONE in these cases
try:
isomers = tuple(EnumerateStereoisomers(mol, options=opts))
except RuntimeError:
for bond in mol[1].GetBonds():
if bond.GetStereo() == Chem.BondStereo.STEREOANY:
bond.SetStereo(Chem.rdchem.BondStereo.STEREONONE)
isomers = tuple(EnumerateStereoisomers(mol, options=opts))
return isomers
I would also assign a fixed seed to enumerate the same isomers each time.
The argument - maximum number of stereoisomers should be inserted in the list of input arguments of the run_dock
script. The default value is 1, so at least one isomer will be generated, if needed.
As a consequence of these changes, the script get_sdf_from_db
should also be changed in a way that retrieved molecule names may optionally include the value from stereo_id
field as a suffix. So, mol names can be generated on the fly by concatenating id
and stereo_id
.
This implementation should be tested on both SMILES, 2D and 3D SDF files as input.
These seems all requirements to incorporate this feature. If you will implement this, please test on several examples.
Alright, so what I get from your comment is that:
1) I should both generate the isomers using the get_isomer(mol)
function and include the stereo id
column inside the database.init_db
function
2) Assign a seed (e.g., seed=43) in the get_isomer(mol)
for consistent result
3) add the stereo_id
restriction for other script to differentiate stereoisomers and produce unique filename with _1,2,3 suffix (e.g., the get_sdf_from_db
, ligand_preparation
, etc)
4) add args.max_stereo in the run_dock
script
Yes, you are right. The only remark is on point 3 - these should be unique molecule names, not filenames.
I also missed the main change - all functions which retrieve mol ids from the database or store/update molecules by their ids should be modified. Here the question is how to do that in the best way and should we keep compatibility with the previous version. i would prefer to keep compatibility, if possible, but this may be difficult to achieve. I see one way to keep compatibility is to add an argument complex_id
to all related functions, that will change the way how these functions treat ids. If complex_id
is True
then the suffix in mol id should be recognized as stereo_id
. In this way we may try to keep compatibility at the level of scripts and API (we have other tools which import easydock as a module and use its functions directly). If you have other suggestions I would like to discuss them.
Ah right. I realized that the script produce 1 filename, so my bad. Can you elaborate on the compatibility issue (I am not familiar with what compatibility I should keep and what does "compatibility with previous version" mean)? I have modified the script to accommodate the stereoisomers by adding the argument stereo_id
into each function that retrieve mol ids from the database and also return stereo_id
in the end should it is received as an input to another function.
My current script:
### Initializing the variable
mol.SetProp('_Name',mol_id)
mol.SetProp('_StereoName',stereo_id)
### then return mol_id, stereo_id in every function affected instead of just mol_id
Reading back your reply again, should I have the function return a different variable with if-else condition?
def mol_dock(mol,config):
mol_id = mol.GetProp('_Name')
stereo_index = mol.GetProp('_StereoName')
# rest of the function
if complex_id:
return mol_id, stereo_index, output
else:
return mol_id, output
But, it will give a lot of if-else statement for the function affected by it
Btw here is the result for my temporary modified script, docked with a random protein and ligand
Saving stereo_id
in a mol property is smart, but this output (mol_id, stereo_index, output
) will create some difficulties for future extensions. Function mol_dock
was implemented as the main interface function which connects EasyDock and a particular docking program. It takes a Mol object and config and returns molecule id and output data as a dictionary. Therefore, I would prefer to keep this interface as simple as possible to easier incorporate other docking programs.
The main changes, I expect, should be made in database.py
where stereo_id should be treated more or less automatically. There is a number of functions which may be affected in some way. The main idea is:
mol_id
from the database as a single value by concatenation of id
and stereo_id
fields on-the-fly, e.g. CHEMBL235_1mol_id
as a concatenated value in input arguments (to properly assign new data in the database).We can implement this feature with stereoisomers in two iterations. On the first one, we implemented stereo_id as a strictly necessary field. On the second one, I'll look whether it is possible and reasonable to make this requirement optional or not.
Okay. So to make it less complicated, we will have a mol_id
to contain both id
and stereo_id
. I have re-edit the script to initialize the variable in the select_mols_to_dock
function with the following:
mol.SetProp('_Name', mol_id +'_'+stereo_index)
Then, is that it? I found that I do not need to edit much more functions other than the database-related functions (i.e., init_db
to assign stereo_id
to the database and update_db
where I split the mol_id
to the respective ids:
def update_db(db_conn, mol_id, data, table_name='mols', commit=True,complex_id=True):
"""
:param db_conn:
:param mol_id: is of a molecule to update values
:param data: dict of column names and values to update
:param table_name:
:return:
"""
if data:
if complex_id:
stereo_index = mol_id.split('_')[1]
mol_id = mol_id.split('_')[0]
else:
stereo_index = ''
cols, values = zip(*data.items())
db_conn.execute(f"""UPDATE {table_name}
SET {', '.join(['%s = ?'] * len(cols))},
time = CURRENT_TIMESTAMP
WHERE
id = ? AND stereo_id = ?
""" % cols, list(values) + [mol_id,stereo_index])
if commit:
db_conn.commit()
I have also do not know what does the complex_id
will do for the easydock (maybe it will be convenient for your other program that use functions related to this). I have also found that EnumerateStereoisomer(mol)
does not produce exceptions (I use the try-except to generate the isomer and differentiate complex_id) if it does not found any stereoisomer (e.g., NAD), which I am not sure if that would cause any other problem outside of my use case.
The SDF output did give the concatenated mol_id
if that is okay for you.
Yes, this implementation is exactly what I thought about.
I would suggest to make the code compact and more errorprone (in your version if the main mol_id contains an underscore, a wrong stereo_id will be generated, e.g. CHEMBL234_5_2, where stereo_id
is 2):
mol_id, stereo_id = mol_id.rsplit('_', 1)
Regarding the error in enumeration of stereoisomers. The error was occurred if a bond at a chiral center is labeled as any (wavy bond in visual editors). Previously it caused a Runtime error
. I did not test latest versions of RDKit.
Oh, I never knew such function exists. I initially plan to use split and join to improve the code, but this is a better solution.
Anyway, I think that is pretty much done. Thanks for helping me adding the stereoisomer feature. I got the desired result for my use case. Let me know if you need me to send you the code (I have also modified the add_protonation function since I do not have chemaxon license, but it is a different mess on itself).
Btw, my use case is a single local machine with SMILES/sdf input with vina program. I have not tried other cases such as cluster machines and other docking programs, so i am not sure if you want the code from me haha.
Again, thank you so much for the help!
Please make a pull request to the main branch. Then I'll be able to review the code and we may finalize all these changes and test them.
Alright, I have created a pull request :D
Solved #21
Hi! I have read the easydock paper and i am interested if it is possible to add the stereoisomer enumeration in the source code. The paper mentioned that it is recommended to either use the RDKit function (I assume it is the rdkit.Chem.EnumerateStereoisomers) or the gen_stereo_rdkit.py script. I am new to RDKit as a whole so I am unsure of what to do. As I tried to understand the script, I don't really know exactly which file I should modify to add this. I thought of modifying these following scripts (PSEUDOCODE), but I am afraid it will interfere with the total number of files going through each node.
The output_fname can be modified to outputfname + '' + index
Again, I don't fully understand the whole script, and would appreciate it if there is a rough outline (or specific script/function) that can be modified to include the stereo isomer.