gcorso / DiffDock

Implementation of DiffDock: Diffusion Steps, Twists, and Turns for Molecular Docking
https://arxiv.org/abs/2210.01776
MIT License
1.1k stars 266 forks source link

Errors encountered when evaluating results #88

Open AbhilashMathews opened 1 year ago

AbhilashMathews commented 1 year ago

Running inference.py appears to work as expected on the provided example, i.e.

python -m inference --protein_ligand_csv data/protein_ligand_example_csv.csv --out_dir results/user_predictions_small --inference_steps 20 --samples_per_complex 40 --batch_size 10 --actual_steps 18 --no_final_step_noise

but when trying to run evaluate_files.py on this sample output, errors arise with regards to reading the molecules and finding directories for the complexes (which are all located in data/PDBBind_processed after being downloaded from zenodo and unzipped). Would you happen to know why these errors are arising on these seemingly standard inputs and fixes to this issue? An excerpt from the error code is displayed below:

(diffdock) [abhi@gpu-1-dy-g4ad4xlarge-1 DiffDock]$ python evaluate_files.py --results_path results/user_predictions_small --file_to_exclude rank1.sdf --num_predictions 40
Reading paths and names.
  0%|                                                      | 0/363 [00:00<?, ?it/s]Can't kekulize mol.  Unkekulized atoms: 7 8 9 10 11
RDKit was unable to read the molecule.
Using the .sdf file failed. We found a .mol2 file instead and are trying to use that.
Did not find a directory for  6qqw . We are skipping that complex
Did not find a directory for  6d08 . We are skipping that complex
Did not find a directory for  6jap . We are skipping that complex
Did not find a directory for  6np2 . We are skipping that complex
Did not find a directory for  6uvp . We are skipping that complex
Did not find a directory for  6oxq . We are skipping that complex
Did not find a directory for  6jsn . We are skipping that complex
Did not find a directory for  6hzb . We are skipping that complex
Can't kekulize mol.  Unkekulized atoms: 7 8 9 10 11
RDKit was unable to read the molecule.
Using the .sdf file failed. We found a .mol2 file instead and are trying to use that.
Did not find a directory for  6qrc . We are skipping that complex
Did not find a directory for  6oio . We are skipping that complex
Did not find a directory for  6jag . We are skipping that complex
Can't kekulize mol.  Unkekulized atoms: 0 1 2 3 4 5 14 15 16
RDKit was unable to read the molecule.
Using the .sdf file failed. We found a .mol2 file instead and are trying to use that.
Did not find a directory for  6moa . We are skipping that complex
Did not find a directory for  6hld . We are skipping that complex
Did not find a directory for  6i9a . We are skipping that complex
Did not find a directory for  6e4c . We are skipping that complex
Did not find a directory for  6g24 . We are skipping that complex
Did not find a directory for  6jb4 . We are skipping that complex
Did not find a directory for  6s55 . We are skipping that complex
  5%|██▏                                         | 18/363 [00:00<00:01, 175.38it/s]Did not find a directory for  6seo . We are skipping that complex
Can't kekulize mol.  Unkekulized atoms: 12 13 14 15 16 17 18 20 21
RDKit was unable to read the molecule.
Using the .sdf file failed. We found a .mol2 file instead and are trying to use that.
Did not find a directory for  6dyz . We are skipping that complex
Did not find a directory for  5zk5 . We are skipping that complex
Did not find a directory for  6jid . We are skipping that complex
Did not find a directory for  5ze6 . We are skipping that complex
...

This may potentially be related to an earlier error en route to generating the language model embeddings:

(diffdock) [abhi@gpu-1-dy-g4ad4xlarge-7 diffdock]$ python datasets/pdbbind_lm_embedding_preparation.py
  0%|                                        | 10/19120 [00:00<22:45, 14.00it/s]encountered unknown AA:  PTR  in the complex  3kxz . Replacing it with a dash - .
  0%|                                        | 12/19120 [00:00<22:11, 14.35it/s]encountered unknown AA:  TPO  in the complex  1re8 . Replacing it with a dash - 
...
gaylong9 commented 1 year ago

Running inference.py appears to work as expected on the provided example, i.e.

python -m inference --protein_ligand_csv data/protein_ligand_example_csv.csv --out_dir results/user_predictions_small --inference_steps 20 --samples_per_complex 40 --batch_size 10 --actual_steps 18 --no_final_step_noise

but when trying to run evaluate_files.py on this sample output, errors arise with regards to reading the molecules and finding directories for the complexes (which are all located in data/PDBBind_processed after being downloaded from zenodo and unzipped). Would you happen to know why these errors are arising on these seemingly standard inputs and fixes to this issue? An excerpt from the error code is displayed below:

(diffdock) [abhi@gpu-1-dy-g4ad4xlarge-1 DiffDock]$ python evaluate_files.py --results_path results/user_predictions_small --file_to_exclude rank1.sdf --num_predictions 40
Reading paths and names.
  0%|                                                      | 0/363 [00:00<?, ?it/s]Can't kekulize mol.  Unkekulized atoms: 7 8 9 10 11
RDKit was unable to read the molecule.
Using the .sdf file failed. We found a .mol2 file instead and are trying to use that.
Did not find a directory for  6qqw . We are skipping that complex
Did not find a directory for  6d08 . We are skipping that complex
Did not find a directory for  6jap . We are skipping that complex
Did not find a directory for  6np2 . We are skipping that complex
Did not find a directory for  6uvp . We are skipping that complex
Did not find a directory for  6oxq . We are skipping that complex
Did not find a directory for  6jsn . We are skipping that complex
Did not find a directory for  6hzb . We are skipping that complex
Can't kekulize mol.  Unkekulized atoms: 7 8 9 10 11
RDKit was unable to read the molecule.
Using the .sdf file failed. We found a .mol2 file instead and are trying to use that.
Did not find a directory for  6qrc . We are skipping that complex
Did not find a directory for  6oio . We are skipping that complex
Did not find a directory for  6jag . We are skipping that complex
Can't kekulize mol.  Unkekulized atoms: 0 1 2 3 4 5 14 15 16
RDKit was unable to read the molecule.
Using the .sdf file failed. We found a .mol2 file instead and are trying to use that.
Did not find a directory for  6moa . We are skipping that complex
Did not find a directory for  6hld . We are skipping that complex
Did not find a directory for  6i9a . We are skipping that complex
Did not find a directory for  6e4c . We are skipping that complex
Did not find a directory for  6g24 . We are skipping that complex
Did not find a directory for  6jb4 . We are skipping that complex
Did not find a directory for  6s55 . We are skipping that complex
  5%|██▏                                         | 18/363 [00:00<00:01, 175.38it/s]Did not find a directory for  6seo . We are skipping that complex
Can't kekulize mol.  Unkekulized atoms: 12 13 14 15 16 17 18 20 21
RDKit was unable to read the molecule.
Using the .sdf file failed. We found a .mol2 file instead and are trying to use that.
Did not find a directory for  6dyz . We are skipping that complex
Did not find a directory for  5zk5 . We are skipping that complex
Did not find a directory for  6jid . We are skipping that complex
Did not find a directory for  5ze6 . We are skipping that complex
...

This may potentially be related to an earlier error en route to generating the language model embeddings:

(diffdock) [abhi@gpu-1-dy-g4ad4xlarge-7 diffdock]$ python datasets/pdbbind_lm_embedding_preparation.py
  0%|                                        | 10/19120 [00:00<22:45, 14.00it/s]encountered unknown AA:  PTR  in the complex  3kxz . Replacing it with a dash - .
  0%|                                        | 12/19120 [00:00<22:11, 14.35it/s]encountered unknown AA:  TPO  in the complex  1re8 . Replacing it with a dash - 
...

I have encountered the same problem. Have you found a solution to this issue yet?

AbhilashMathews commented 1 year ago

Not yet — I have not explored solutions for this issue further at this time

xuzhang5788 commented 1 year ago

I also have the same errors. Hopefully, this issue could be solved soon.

JuLieAlgebra commented 1 year ago

Same issue

Xu-kexin commented 1 year ago

Maybe check the directory of input, if the name of folder is the sequence like '6q36' but not numbers or other things.

HannesStark commented 1 year ago

This sounds to me like there were issues when running inference. Then the results were not placed in “--out_dir results/user_predictions_small” Then they are not in the list when listdir lists that directory and that message is thrown.

Would you mind checking if the issue was during inference and the results from inference were never placed in "results/user_predictions_small". If that is the case, it would be useful to see the error during inference that causes the complex to be skipped.

maartensandbox commented 9 months ago

This sounds to me like there were issues when running inference. Then the results were not placed in “--out_dir results/user_predictions_small” Then they are not in the list when listdir lists that directory and that message is thrown.

Would you mind checking if the issue was during inference and the results from inference were never placed in "results/user_predictions_small". If that is the case, it would be useful to see the error during inference that causes the complex to be skipped.

If I look in my results dir, I see that the inferred results are written to numbered folders ( there are directories called 0,1,... ) Instead, the evaluate_results script assumes that these folders have been named using a different scheme. See also the issue https://github.com/gcorso/DiffDock/issues/125