ersilia-os / vsflow-ligand-based-3d-screening

Ligand-based virtual screening based on vsflow
GNU General Public License v3.0
0 stars 0 forks source link

Some query molecules fail #1

Closed GemmaTuron closed 8 hours ago

GemmaTuron commented 2 weeks ago

Hi @miquelduranfrigola

I've been playing around with the package. I am unable to understand why some query molecules fail to work, the error is in this piece of code from VSFlow

# group query molecules if they are continuous and have the same canonical SMILES
if args.input:
    if args.input.endswith(".sdf") or args.input.endswith(".sdf.gz"):
        confs = [(query[i]["mol"], i, query[i]["pattern"]) for i in query if query[i]["mol"].GetConformer().Is3D()]
        if confs:
            gr_confs = [list(group) for k, group in groupby(confs, lambda x: x[2])]
            # combine conformers if they belong to the same molecule, consider them as one query with different
            # conformers
            for gr in gr_confs:
                if len(gr) > 1:
                    for i in range(1, len(gr)):
                        query[gr[0][1]]["mol"].AddConformer(query[gr[i][1]]["mol"].GetConformer(), assignId=True)
                        query.pop(i)

and it pops up:

Traceback (most recent call last):
  File "/home/gturon/miniconda3/envs/vsflow/bin/vsflow", line 7, in <module>
    exec(compile(f.read(), __file__, 'exec'))
  File "/home/gturon/github/ersilia-os/VSFlow/vsflow", line 6, in <module>
    run.main()
  File "/home/gturon/github/ersilia-os/VSFlow/vslib/run.py", line 1569, in main
    args.func(args)
  File "/home/gturon/github/ersilia-os/VSFlow/vslib/run.py", line 901, in shape
    query.pop(i)

Aside from the example in the folder, you can try with these molecules: COc1ccc(cc1OC1CCCC1)[C@@H]1CNC(=O)C1,smi1 -- Does not work NCC@Hc1ccc(Cl)cc1,smi2 -- Does work

Use the smiles-to-3d to convert them to .sdf files and input those directly in the pipeline

GemmaTuron commented 2 weeks ago

I have identified the issue and it seems a bug on their end. It tries to put conformers together in the above loop, but, for example (and bear with me) inside the loop

                    for i in range(1, len(gr)):
                        query[gr[0][1]]["mol"].AddConformer(query[gr[i][1]]["mol"].GetConformer(), assignId=True)
                        query.pop(i)

the query is:

{0: {'mol': <rdkit.Chem.rdchem.Mol object at 0x7e34c2a25540>, 'pattern': 'COc1ccc([C@H]2CNC(=O)C2)cc1OC1CCCC1'}, 1: {'mol': <rdkit.Chem.rdchem.Mol object at 0x7e34c2a25460>, 'pattern': 'COc1ccc([C@@H]2CNC(=O)C2)cc1OC1CCCC1'}, 2: {'mol': <rdkit.Chem.rdchem.Mol object at 0x7e34c2a255b0>, 'pattern': 'COc1ccc([C@@H]2CNC(=O)C2)cc1OC1CCCC1'}, 3: {'mol': <rdkit.Chem.rdchem.Mol object at 0x7e34c2a25700>, 'pattern': 'COc1ccc([C@@H]2CNC(=O)C2)cc1OC1CCCC1'}, 4: {'mol': <rdkit.Chem.rdchem.Mol object at 0x7e34c2a25770>, 'pattern': 'COc1ccc([C@H]2CNC(=O)C2)cc1OC1CCCC1'}, 5: {'mol': <rdkit.Chem.rdchem.Mol object at 0x7e34c2a253f0>, 'pattern': 'COc1ccc([C@@H]2CNC(=O)C2)cc1OC1CCCC1'}}

so that the query[i] (always starting at 1, not 0) is the element of the dictionary 1 in the first iteration {'mol': <rdkit.Chem.rdchem.Mol object at 0x7e34c2a25460>, 'pattern': 'COc1ccc([C@@H]2CNC(=O)C2)cc1OC1CCCC1'}

The gr is a list of the available conformers: [(<rdkit.Chem.rdchem.Mol object at 0x7e34c2a25460>, 1, 'COc1ccc([C@@H]2CNC(=O)C2)cc1OC1CCCC1'), (<rdkit.Chem.rdchem.Mol object at 0x7e34c2a255b0>, 2, 'COc1ccc([C@@H]2CNC(=O)C2)cc1OC1CCCC1'), (<rdkit.Chem.rdchem.Mol object at 0x7e34c2a25700>, 3, 'COc1ccc([C@@H]2CNC(=O)C2)cc1OC1CCCC1')]

If we look at what gr[0] is, that selects the first available conformer: (<rdkit.Chem.rdchem.Mol object at 0x7e34c2a25460>, 1, 'COc1ccc([C@@H]2CNC(=O)C2)cc1OC1CCCC1'),

Therefore, the query[gr[0][1]] is the first conformer, position 1 == > 1 which in turn corresponds to the element in the query dictionary with key 1 query[gr[0][1]]: {'mol': <rdkit.Chem.rdchem.Mol object at 0x7e34c2a25460>, 'pattern': 'COc1ccc([C@@H]2CNC(=O)C2)cc1OC1CCCC1'}

it means that we are adding conformers to the conformer 1, not the 0. But if we proceed with the loop, we are popping out query[1], so in the next iteration, the element query[gr[0][1]], which corresponds to 1, will not exist and the pipeline will crash This is the query after the pop in the first iteration of the loop: {0: {'mol': <rdkit.Chem.rdchem.Mol object at 0x7e34c2a25540>, 'pattern': 'COc1ccc([C@H]2CNC(=O)C2)cc1OC1CCCC1'}, 2: {'mol': <rdkit.Chem.rdchem.Mol object at 0x7e34c2a255b0>, 'pattern': 'COc1ccc([C@@H]2CNC(=O)C2)cc1OC1CCCC1'}, 3: {'mol': <rdkit.Chem.rdchem.Mol object at 0x7e34c2a25700>, 'pattern': 'COc1ccc([C@@H]2CNC(=O)C2)cc1OC1CCCC1'}, 4: {'mol': <rdkit.Chem.rdchem.Mol object at 0x7e34c2a25770>, 'pattern': 'COc1ccc([C@H]2CNC(=O)C2)cc1OC1CCCC1'}, 5: {'mol': <rdkit.Chem.rdchem.Mol object at 0x7e34c2a253f0>, 'pattern': 'COc1ccc([C@@H]2CNC(=O)C2)cc1OC1CCCC1'}}

In summary, it has to do with how this conformer lists are created:

            confs = [(query[i]["mol"], i, query[i]["pattern"]) for i in query if query[i]["mol"].GetConformer().Is3D()]
            if confs:
                gr_confs = [list(group) for k, group in groupby(confs, lambda x: x[2])]

if they start at any number that is not 0, it will find the above problem

GemmaTuron commented 2 weeks ago

This is how the gr_confs are being created: [[(<rdkit.Chem.rdchem.Mol object at 0x729cc2039540>, 0, 'COc1ccc([C@H]2CNC(=O)C2)cc1OC1CCCC1')], [(<rdkit.Chem.rdchem.Mol object at 0x729cc2039460>, 1, 'COc1ccc([C@@H]2CNC(=O)C2)cc1OC1CCCC1'), (<rdkit.Chem.rdchem.Mol object at 0x729cc20395b0>, 2, 'COc1ccc([C@@H]2CNC(=O)C2)cc1OC1CCCC1'), (<rdkit.Chem.rdchem.Mol object at 0x729cc2039700>, 3, 'COc1ccc([C@@H]2CNC(=O)C2)cc1OC1CCCC1')], [(<rdkit.Chem.rdchem.Mol object at 0x729cc2039770>, 4, 'COc1ccc([C@H]2CNC(=O)C2)cc1OC1CCCC1')], [(<rdkit.Chem.rdchem.Mol object at 0x729cc20393f0>, 5, 'COc1ccc([C@@H]2CNC(=O)C2)cc1OC1CCCC1')]]

for gr in gr_confs:
    print(gr)

[(<rdkit.Chem.rdchem.Mol object at 0x729cc2039540>, 0, 'COc1ccc([C@H]2CNC(=O)C2)cc1OC1CCCC1')],

[(<rdkit.Chem.rdchem.Mol object at 0x729cc2039460>, 1, 'COc1ccc([C@@H]2CNC(=O)C2)cc1OC1CCCC1'), (<rdkit.Chem.rdchem.Mol object at 0x729cc20395b0>, 2, 'COc1ccc([C@@H]2CNC(=O)C2)cc1OC1CCCC1'), (<rdkit.Chem.rdchem.Mol object at 0x729cc2039700>, 3, 'COc1ccc([C@@H]2CNC(=O)C2)cc1OC1CCCC1')],

[(<rdkit.Chem.rdchem.Mol object at 0x729cc2039770>, 4, 'COc1ccc([C@H]2CNC(=O)C2)cc1OC1CCCC1')],

[(<rdkit.Chem.rdchem.Mol object at 0x729cc20393f0>, 5, 'COc1ccc([C@@H]2CNC(=O)C2)cc1OC1CCCC1')]`

The issue is obviiusly with the one that has len 3

Not clear if there is a solution for that error, seems too embbeded in their code

GemmaTuron commented 2 weeks ago

more findings on this. The issue is how is the code grouping the same molecules - I am tempted to comment out this piece of code so that all conformers are screened regardless of whether they represent the same molecule or not. But just for the sake of the argument in this thread:

confs = [(<rdkit.Chem.rdchem.Mol object at 0x75604f3ed0e0>, 0, 'COc1ccc([C@H]2CNC(=O)C2)cc1OC1CCCC1'), (<rdkit.Chem.rdchem.Mol object at 0x75604f3ed000>, 1, 'COc1ccc([C@@H]2CNC(=O)C2)cc1OC1CCCC1'), (<rdkit.Chem.rdchem.Mol object at 0x75604f3ed150>, 2, 'COc1ccc([C@@H]2CNC(=O)C2)cc1OC1CCCC1'), (<rdkit.Chem.rdchem.Mol object at 0x75604f3ed2a0>, 3, 'COc1ccc([C@@H]2CNC(=O)C2)cc1OC1CCCC1'), (<rdkit.Chem.rdchem.Mol object at 0x75604f3ed310>, 4, 'COc1ccc([C@H]2CNC(=O)C2)cc1OC1CCCC1'), (<rdkit.Chem.rdchem.Mol object at 0x75604f3ecf90>, 5, 'COc1ccc([C@@H]2CNC(=O)C2)cc1OC1CCCC1')]

so, len(confs) = 6

If we go onto the next step: gr_confs = [list(group) for k, group in groupby(confs, lambda x: x[2])]

This will group the tuples by the third element (the chemical structure string, x[2]) and create a list of these groups.
I see two SMILES: COc1ccc([C@H]2CNC(=O)C2)cc1OC1CCCC1 and COc1ccc([C@@H]2CNC(=O)C2)cc1OC1CCCC1 but when I run the code I get 4 groups, not two (len(gr_confs)=4) [[(<rdkit.Chem.rdchem.Mol object at 0x75604f3ed0e0>, 0, 'COc1ccc([C@H]2CNC(=O)C2)cc1OC1CCCC1')], [(<rdkit.Chem.rdchem.Mol object at 0x75604f3ed000>, 1, 'COc1ccc([C@@H]2CNC(=O)C2)cc1OC1CCCC1'), (<rdkit.Chem.rdchem.Mol object at 0x75604f3ed150>, 2, 'COc1ccc([C@@H]2CNC(=O)C2)cc1OC1CCCC1'), (<rdkit.Chem.rdchem.Mol object at 0x75604f3ed2a0>, 3, 'COc1ccc([C@@H]2CNC(=O)C2)cc1OC1CCCC1')], [(<rdkit.Chem.rdchem.Mol object at 0x75604f3ed310>, 4, 'COc1ccc([C@H]2CNC(=O)C2)cc1OC1CCCC1')], [(<rdkit.Chem.rdchem.Mol object at 0x75604f3ecf90>, 5, 'COc1ccc([C@@H]2CNC(=O)C2)cc1OC1CCCC1')]]

According to chatGPT: The groupby function in Python's itertools module only groups adjacent elements. If the elements are not already sorted by the key you are grouping by, groupby will treat each discontinuous segment as a separate group.

Implementing the following, indeed we get two groups

confs_sorted = sorted(confs, key=lambda x: x[2])
gr_confs = [list(group) for k, group in groupby(confs_sorted, lambda x: x[2])]

2 [[(<rdkit.Chem.rdchem.Mol object at 0x79c0c85d93f0>, 1, 'COc1ccc([C@@H]2CNC(=O)C2)cc1OC1CCCC1'), (<rdkit.Chem.rdchem.Mol object at 0x79c0c85d9540>, 2, 'COc1ccc([C@@H]2CNC(=O)C2)cc1OC1CCCC1'), (<rdkit.Chem.rdchem.Mol object at 0x79c0c85d9690>, 3, 'COc1ccc([C@@H]2CNC(=O)C2)cc1OC1CCCC1'), (<rdkit.Chem.rdchem.Mol object at 0x79c0c85d9380>, 5, 'COc1ccc([C@@H]2CNC(=O)C2)cc1OC1CCCC1')], [(<rdkit.Chem.rdchem.Mol object at 0x79c0c85d94d0>, 0, 'COc1ccc([C@H]2CNC(=O)C2)cc1OC1CCCC1'), (<rdkit.Chem.rdchem.Mol object at 0x79c0c85d9700>, 4, 'COc1ccc([C@H]2CNC(=O)C2)cc1OC1CCCC1')]]

but we are still stuck in the next step when processing the groups.

GemmaTuron commented 2 weeks ago

Finally, even when I input a molecule that works, for some reason the query molecule has 2 valid conformers but onluy one of them is written into the final output. It has to do with this piece of code:

for j in query:
        counter = 0
        with open(f"{out_file}_{j + 1}.sdf", "w") as out:
            for i in results:
                if results[i]["q_num"] == j:
                    write_output.write_sdf_conformer(results[i]["mol"], results[i]["props"], results[i]["top_conf"],
                                                     out)
                    counter += 1
        if counter:
            with open(f"{out_file}_{j + 1}_query.sdf", "w") as out_q:
                for confid in range(query[j]["confs"].GetNumConformers()):
                    write_output.write_sdf_conformer(query[j]["confs"], {"Smiles": query[j]["pattern"]}, confid, out_q)
        else:
            os.remove(f"{out_file}_{j + 1}.sdf")
            print(f"No results found for query molecule {j}")

In particular the statement if results[i]["q_num"] == j: but I do not understand what this q_num stands for and why it is used to decide what gets saved

GemmaTuron commented 2 weeks ago

In any case, in the publication they state that:

The intended use case of the shape screening mode is to screen a database of compounds with multiple conformers (prepared e.g. using the preparedb functionality of VSFlow) and to use a query ligand in a single, bioactive conformation, e.g. from the pdb database.

So I suggest inputing a query molecule with only one conformer in the .sdf file. I've automatically implemented the selection of the best conformer based on docking if there is more than one conformer available in this pipeline This should suffice to resolve the problems?

GemmaTuron commented 2 weeks ago

@miquelduranfrigola

Since fixing this would take a substantial amount of time, do we agree that we will only use VSFlow with a query input that has one single conformer? If so I'll close this issue

miquelduranfrigola commented 2 weeks ago

Let's do this, yes. Thanks for a very detailed explanation