ci-lab-cz / easydock

BSD 3-Clause "New" or "Revised" License
35 stars 12 forks source link

adding stereoisomer enumeration #20

Closed Feriolet closed 4 months ago

Feriolet commented 4 months ago

Hi! I have read the easydock paper and i am interested if it is possible to add the stereoisomer enumeration in the source code. The paper mentioned that it is recommended to either use the RDKit function (I assume it is the rdkit.Chem.EnumerateStereoisomers) or the gen_stereo_rdkit.py script. I am new to RDKit as a whole so I am unsure of what to do. As I tried to understand the script, I don't really know exactly which file I should modify to add this. I thought of modifying these following scripts (PSEUDOCODE), but I am afraid it will interfere with the total number of files going through each node.

vina_dock.mol_dock(mol,...):
    stereoisomer_tuple = EnumerateStereoisomer(mol)
    for index, mol in enumerate(stereoisomer_tuple):
        ligand_pdbqt = ligand_preparation(mol)
        ..... (the rest of the code)

The output_fname can be modified to outputfname + '' + index

Again, I don't fully understand the whole script, and would appreciate it if there is a rough outline (or specific script/function) that can be modified to include the stereo isomer.

DrrDom commented 4 months ago

The required implementation is a little bit more complex.

  1. Stereoisomers should be explicitly enumerated and stored into the database and individual entries. This means that this should be done during the preliminary step of database population with molecules for docking, not during the preparation molecules for docking..
  2. All stereoisomers should have distinct name. This can be solved in two ways:
    • modification of original mol names with a suffix _1, _2, etc
    • introduction of an additional column in the database table with a number of stereoisomer (stereo_id): 1, 2, etc

The former will change the original names that can be considered as unexpected behavior. The latter will require an additional restriction in database table, that the combination of id and stereo_id should be unique. The latter choice is preferable from my point of view.

  1. A prototype of the function which will enumerate isomers is below (if the issue with STEREOANY was fixed, than the function can be greatly simplified)

    def get_isomers(mol):
    opts = StereoEnumerationOptions(tryEmbedding=True, maxIsomers=32)
    # this is a workaround for rdkit issue - if a double bond has STEREOANY it will cause errors at
    # stereoisomer enumeration, we replace STEREOANY with STEREONONE in these cases
    try:
        isomers = tuple(EnumerateStereoisomers(mol, options=opts))
    except RuntimeError:
        for bond in mol[1].GetBonds():
            if bond.GetStereo() == Chem.BondStereo.STEREOANY:
                bond.SetStereo(Chem.rdchem.BondStereo.STEREONONE)
        isomers = tuple(EnumerateStereoisomers(mol, options=opts))
    return isomers

    I would also assign a fixed seed to enumerate the same isomers each time.

  2. The argument - maximum number of stereoisomers should be inserted in the list of input arguments of the run_dockscript. The default value is 1, so at least one isomer will be generated, if needed.

As a consequence of these changes, the script get_sdf_from_db should also be changed in a way that retrieved molecule names may optionally include the value from stereo_id field as a suffix. So, mol names can be generated on the fly by concatenating id and stereo_id.

This implementation should be tested on both SMILES, 2D and 3D SDF files as input.

These seems all requirements to incorporate this feature. If you will implement this, please test on several examples.

Feriolet commented 4 months ago

Alright, so what I get from your comment is that:

1) I should both generate the isomers using the get_isomer(mol) function and include the stereo id column inside the database.init_db function 2) Assign a seed (e.g., seed=43) in the get_isomer(mol) for consistent result 3) add the stereo_id restriction for other script to differentiate stereoisomers and produce unique filename with _1,2,3 suffix (e.g., the get_sdf_from_db, ligand_preparation, etc) 4) add args.max_stereo in the run_dock script

DrrDom commented 4 months ago

Yes, you are right. The only remark is on point 3 - these should be unique molecule names, not filenames.

I also missed the main change - all functions which retrieve mol ids from the database or store/update molecules by their ids should be modified. Here the question is how to do that in the best way and should we keep compatibility with the previous version. i would prefer to keep compatibility, if possible, but this may be difficult to achieve. I see one way to keep compatibility is to add an argument complex_id to all related functions, that will change the way how these functions treat ids. If complex_id is True then the suffix in mol id should be recognized as stereo_id. In this way we may try to keep compatibility at the level of scripts and API (we have other tools which import easydock as a module and use its functions directly). If you have other suggestions I would like to discuss them.

Feriolet commented 4 months ago

Ah right. I realized that the script produce 1 filename, so my bad. Can you elaborate on the compatibility issue (I am not familiar with what compatibility I should keep and what does "compatibility with previous version" mean)? I have modified the script to accommodate the stereoisomers by adding the argument stereo_id into each function that retrieve mol ids from the database and also return stereo_id in the end should it is received as an input to another function.

My current script:

### Initializing the variable
mol.SetProp('_Name',mol_id)
mol.SetProp('_StereoName',stereo_id)

### then return mol_id, stereo_id in every function affected instead of just mol_id

Reading back your reply again, should I have the function return a different variable with if-else condition?

def mol_dock(mol,config):
    mol_id = mol.GetProp('_Name')
    stereo_index = mol.GetProp('_StereoName')
    # rest of the function

    if complex_id:
       return mol_id, stereo_index, output
    else:
       return mol_id, output

But, it will give a lot of if-else statement for the function affected by it

Feriolet commented 4 months ago

Btw here is the result for my temporary modified script, docked with a random protein and ligand

Screenshot 2024-02-25 at 5 31 28 PM
DrrDom commented 4 months ago

Saving stereo_id in a mol property is smart, but this output (mol_id, stereo_index, output) will create some difficulties for future extensions. Function mol_dock was implemented as the main interface function which connects EasyDock and a particular docking program. It takes a Mol object and config and returns molecule id and output data as a dictionary. Therefore, I would prefer to keep this interface as simple as possible to easier incorporate other docking programs.

The main changes, I expect, should be made in database.py where stereo_id should be treated more or less automatically. There is a number of functions which may be affected in some way. The main idea is:

We can implement this feature with stereoisomers in two iterations. On the first one, we implemented stereo_id as a strictly necessary field. On the second one, I'll look whether it is possible and reasonable to make this requirement optional or not.

Feriolet commented 4 months ago

Okay. So to make it less complicated, we will have a mol_id to contain both id and stereo_id. I have re-edit the script to initialize the variable in the select_mols_to_dock function with the following:

mol.SetProp('_Name', mol_id +'_'+stereo_index)

Then, is that it? I found that I do not need to edit much more functions other than the database-related functions (i.e., init_db to assign stereo_id to the database and update_db where I split the mol_id to the respective ids:

def update_db(db_conn, mol_id, data, table_name='mols', commit=True,complex_id=True):
    """

    :param db_conn:
    :param mol_id: is of a molecule to update values
    :param data: dict of column names and values to update
    :param table_name:
    :return:
    """
    if data:
        if complex_id:
            stereo_index = mol_id.split('_')[1]
            mol_id = mol_id.split('_')[0]
        else:
            stereo_index = ''
        cols, values = zip(*data.items())
        db_conn.execute(f"""UPDATE {table_name}
                        SET {', '.join(['%s = ?'] * len(cols))},
                            time = CURRENT_TIMESTAMP
                        WHERE
                            id = ? AND stereo_id = ?
                        """ % cols, list(values) + [mol_id,stereo_index])
        if commit:
            db_conn.commit()

I have also do not know what does the complex_id will do for the easydock (maybe it will be convenient for your other program that use functions related to this). I have also found that EnumerateStereoisomer(mol) does not produce exceptions (I use the try-except to generate the isomer and differentiate complex_id) if it does not found any stereoisomer (e.g., NAD), which I am not sure if that would cause any other problem outside of my use case.

The SDF output did give the concatenated mol_id if that is okay for you.

DrrDom commented 4 months ago

Yes, this implementation is exactly what I thought about.

I would suggest to make the code compact and more errorprone (in your version if the main mol_id contains an underscore, a wrong stereo_id will be generated, e.g. CHEMBL234_5_2, where stereo_id is 2):

mol_id, stereo_id = mol_id.rsplit('_', 1)

Regarding the error in enumeration of stereoisomers. The error was occurred if a bond at a chiral center is labeled as any (wavy bond in visual editors). Previously it caused a Runtime error. I did not test latest versions of RDKit.

Feriolet commented 4 months ago

Oh, I never knew such function exists. I initially plan to use split and join to improve the code, but this is a better solution.

Anyway, I think that is pretty much done. Thanks for helping me adding the stereoisomer feature. I got the desired result for my use case. Let me know if you need me to send you the code (I have also modified the add_protonation function since I do not have chemaxon license, but it is a different mess on itself).

Btw, my use case is a single local machine with SMILES/sdf input with vina program. I have not tried other cases such as cluster machines and other docking programs, so i am not sure if you want the code from me haha.

Again, thank you so much for the help!

DrrDom commented 4 months ago

Please make a pull request to the main branch. Then I'll be able to review the code and we may finalize all these changes and test them.

Feriolet commented 4 months ago

Alright, I have created a pull request :D

DrrDom commented 4 months ago

Solved #21