gcorso / DiffDock

Implementation of DiffDock: Diffusion Steps, Twists, and Turns for Molecular Docking
https://arxiv.org/abs/2210.01776
MIT License
1.04k stars 251 forks source link

Question about esm_embeddings #59

Open Alue111 opened 1 year ago

Alue111 commented 1 year ago

when I run this code python datasets/esm_embedding_preparation.py --protein_ligand_csv data/protein_ligand_example_csv.csv --out_file data/prepared_for_esm.fasta I met this notice:

encountered unknown AA:  TPO  in the complex  /raid/ligl/data/data/PDBBind_processed/4yo6/4yo6_protein_processed.pdb . Replacing it with a dash - .
encountered unknown AA:  SEP  in the complex  /raid/ligl/data/data/PDBBind_processed/4yo6/4yo6_protein_processed.pdb . Replacing it with a dash - .
  0%|                                        | 39/16379 [00:03<20:34, 13.24it/s]encountered unknown AA:  SEP  in the complex  /raid/ligl/data/data/PDBBind_processed/6esa/6esa_protein_processed.pdb . Replacing it with a dash - .
encountered unknown AA:  SEP  in the complex  /raid/ligl/data/data/PDBBind_processed/6esa/6esa_protein_processed.pdb . Replacing it with a dash - .
encountered unknown AA:  TPO  in the complex  /raid/ligl/data/data/PDBBind_processed/6esa/6esa_protein_processed.pdb . Replacing it with a dash - .
encountered unknown AA:  SEP  in the complex  /raid/ligl/data/data/PDBBind_processed/6esa/6esa_protein_processed.pdb . Replacing it with a dash - .
  0%|▏                                       | 79/16379 [00:07<30:56,  8.78it/s]encountered unknown AA:  SEP  in the complex  /raid/ligl/data/data/PDBBind_processed/3f2a/3f2a_protein_processed.pdb . Replacing it with a dash - .
  1%|▏                                       | 97/16379 [00:09<25:34, 10.61it/s]encountered unknown AA:  PCA  in the complex  /raid/ligl/data/data/PDBBind_processed/5t1k/5t1k_protein_processed.pdb . Replacing it with a dash - .
  1%|▏                                       | 99/16379 [00:09<24:53, 10.90it/s]encountered unknown AA:  SEP  in the complex  

When I produce dataset with esm2_3billion_embeddings.pt I met this notice:

loading complexes 9/17:  52%|████████▎       | 523/1000 [03:39<04:09,  1.91it/s]Skipping 2pcp because of the error:
Encountered valid chain id that was not present in the LM embeddings
loading complexes 9/17:  53%|████████▍       | 526/1000 [03:40<02:35,  3.04it/s]Skipping 4do4 because of the error:
Encountered valid chain id that was not present in the LM embeddings
loading complexes 9/17:  53%|████████▍       | 528/1000 [03:40<02:18,  3.40it/s]Skipping 1rri because of the error:
Encountered valid chain id that was not present in the LM embeddings
loading complexes 9/17:  53%|████████▍       | 530/1000 [03:41<03:11,  2.45it/s]Skipping 4k9g because of the error:
Encountered valid chain id that was not present in the LM embeddings
loading complexes 9/17:  54%|████████▌       | 539/1000 [03:44<02:45,  2.78it/s]Skipping 2qwd because of the error:
Encountered valid chain id that was not present in the LM embeddings
loading complexes 9/17:  54%|████████▋       | 540/1000 [03:45<02:50,  2.70it/s]Skipping 1hqh because of the error:
Encountered valid chain id that was not present in the LM embeddings

Both trainset and testst, it skip many items. Is it normal? How can I fix this error?

gaylong9 commented 1 year ago

Same problem. Have you found a solution to this issue yet? 😞

RJ3 commented 1 year ago

Probably a typo of some sort, TPO and SEP are not amino acid codes so they got skipped. Then when the fastq are processed later the discrepancy is found.

JuLieAlgebra commented 1 year ago

Same problem, it says it's skipping for every single complex.

HannesStark commented 1 year ago

Are you sure that you are using the same .fasta file for both steps? @JuLieAlgebra Would you be able to describe the specific setting and procedure you run for one of the proteins where the issue occurs?

JacekKedzierski commented 7 months ago

It seems that the error occurs when one tries to retrain the model with own complexes containing and '_' (underscore) in the name of the structure. In such a case the keyname = key.split('')[0] assigns the wrong value to the key_name.

kirmedvedev commented 1 month ago

SEP is phosphorylated SER and TPO is phosphorylated THR. I am wondering if anyone can explain how this kind of amino acids are handled by DiffDock? Does DiffDock just skip them? What will happen if SEP is renamed to SER? Will it be handled just as SER?