gcorso / DiffDock

Implementation of DiffDock: Diffusion Steps, Twists, and Turns for Molecular Docking
https://arxiv.org/abs/2210.01776
MIT License
1.08k stars 263 forks source link

Mismatches beetween lm_embeddings order and structure chains order #58

Open vtarasv opened 1 year ago

vtarasv commented 1 year ago

Trying to reproduce the training process, I found that at some point in the dataset preparation algorithm the order of lm_embeddings and corresponding chains mismatch (for some proteins with multiple chains in the structure). The example I found is the protein from 3doz complex, where lm_embeddings are concatenated in the order of chains [D, B, A] and all other protein graph features in the order [A, B, D]. I believe it happens because of this part https://github.com/gcorso/DiffDock/blob/8e853d6b14fb57baf90fa8529349117439f06819/datasets/pdbbind.py#L133-L141 which does not guarantee the same order as the order of chains in a .pdb file.

HannesStark commented 1 year ago

Thank you a lot for finding this and letting us know! We will push a fix together with some other improvements soon instead of immediately to not disrupt compatibility with the currently provided weights.

JacekKedzierski commented 9 months ago

It seems that the error occurs when one tries to retrain the model with own complexes containing and '_' (underscore) in the name of the structure. In such a case the keyname = key.split('')[0] assigns the wrong value to the key_name.