error with pdb file predicted by AlphaFold2

gcorso / DiffDock

Implementation of DiffDock: Diffusion Steps, Twists, and Turns for Molecular Docking

https://arxiv.org/abs/2210.01776

MIT License

1.08k stars 263 forks source link

error with pdb file predicted by AlphaFold2 #65

Closed FeiLiuEM closed 1 year ago

FeiLiuEM commented 1 year ago

I have a problem of pdb file predicted by AlphaFold2.

I use the structure of AlphaFold2. I tried different ways such as pymol.cmd (h_add, fix_chemistry etc).

However, when I run the code python -m inference --protein_ligand_csv data/protein_ligand_example_csv.csv --out_dir results/user_predictions_small --inference_steps 12 --samples_per_complex 12 --batch_size 4 --no_final_step_noise, it always reported the error of LM embeddings for complex data/ALPHAFOLD2_V4_MODIFIED/107.pdb____COc(cc1)ccc1C#N did not have the right length for the protein. Skipping.

I don't know how to solve the problem. I could only locate the code reported the error in datasets/pdbbind.py. But the code is too complex for me. Could someone tell me how to slove the problem? @duerrsimon @gcorso @HannesStark @bjing2016

FeiLiuEM commented 1 year ago

In our research, many proteins' tasks occur the problem. I've been stuck here for a week. I need help ╥﹏╥

HannesStark commented 1 year ago

This error occurs if the number of AminoAcids in the input file is not the same as the number of amino acids in the ESM embeddings. Possibly this is the case because you did not construct the ESM embeddings from the AF2 structure file but from the original sequence that you used as input to AlphaFold, and this has a different number of amino acids from the generated AF2 structure file. Be sure that you are using this datasets/pdbbind_lm_embedding_preparation.py for creating your embeddings.

If this is not what you did and not the issue, then I would recommend looking into why this function https://github.com/gcorso/DiffDock/blob/fff8f0b5eb98a49980553096fdd283c27f8cf022/datasets/process_mols.py#L152 has different output lengths from the generated embeddings.

FeiLiuEM commented 1 year ago

Thanks a million！

I still have a little question.

In the protocol of DiffDock readme.md, it used python datasets/esm_embedding_preparation.py to create embeddings. And datasets/pdbbind_lm_embedding_preparation.py is used for retraining DiffDock.

Does this mean that all predicted proteins which not included in data/PDBBind_processed need to retrain DiffDock?

This error occurs if the number of AminoAcids in the input file is not the same as the number of amino acids in the ESM embeddings. Possibly this is the case because you did not construct the ESM embeddings from the AF2 structure file but from the original sequence that you used as input to AlphaFold, and this has a different number of amino acids from the generated AF2 structure file. Be sure that you are using this datasets/pdbbind_lm_embedding_preparation.py for creating your embeddings.

If this is not what you did and not the issue, then I would recommend looking into why this function

https://github.com/gcorso/DiffDock/blob/fff8f0b5eb98a49980553096fdd283c27f8cf022/datasets/process_mols.py#L152

has different output lengths from the generated embeddings.

FeiLiuEM commented 1 year ago

We are sorry. We still fail for the same reason.

I had been wondering if you could help me. @HannesStark

The protein is downloaded from here. And the ligand is COc(cc1)ccc1C#N.

We still failed after using datasets/pdbbind_lm_embedding_preparation.py.

Thank you very much indeed!

HannesStark commented 1 year ago

The difference between datasets/esm_embedding_preparation.py to create embeddings and datasets/pdbbind_lm_embedding_preparation.py is just that one is for csv files for inference and the other is an easy setup to preprocess all the training data. There is no need to retrain DiffDock.

I will have a look at your protein and ligand.

HannesStark commented 1 year ago

I am not sure where the issue should be. I had no trouble running inference with the .pdb file and SMILES you provided.

Please try following the instructions in the Readme as well for the pdb file that you provided and feel free to reopen the issue if it still does not work out. Attached are the results.

GithubIssueAFresults.zip

FeiLiuEM commented 1 year ago

@HannesStark

Dear Prof.Hannes

Thanks a million!

DiffDock is a very excellent automated ligand evaluator AI tool. Computational efficiency and success rate are high. This is very special.

I'm a clinician in China. I'm not good at programming.

There are many problems for me. For example, there are at least three different conda environments in different places of DiffDock: environment.yml, readme.md, and google colab notebook. Among them, the !pip install torch==1.12.1+cu113 --quiet reports error in google colab notebook. As a clinician, although I finally solved the environmental problem. I don't have enough time to solve other problems. It's over 00:00 in China now and I will get up 6 hours later for hospital.

Could you please write the process of calculating the GithubIssueAFresults.zip? We don't need the process of get conda environment. Just the whole process of calculating the result.

Please

Best wishes!

Fei Liu Nanjing University School of medicine