facebookresearch / esm

Evolutionary Scale Modeling (esm): Pretrained language models for proteins
MIT License
3.26k stars 643 forks source link

calculate the log likelihood of 2 chains vs a 3rd? #195

Closed avilella closed 2 years ago

avilella commented 2 years ago

Is it possible to use the log likelihood script to calculate the joined log likelihood of 2 input fasta chains against the 3rd chain in a pdb file?

E.g. If the pdb has chains H, L, and A, could we use the script to feed in the fasta sequences of a query H + a query L of the same length as H and L, then get the likelihood against chain A in the pdb?

If the answer is "not easily", could we somehow re-write the pdb to pretend that H+L are the same chain, maybe by renaming HL and adding a link in the pdb file between the two, then run the log likelihood on "HL" vs the edited pdb?

Thanks

tomsercu commented 2 years ago

That seems like a totally reasonable thing to do. It'll just be a matter of setting it up right. See the paper Hsu et al. 2022 subsection on protein complexes, for one way to set this up: by concatenating together with 10 mask tokens between chains. Other creative ways may be possible, feel free to share and discuss in the Discussions tab of this repo! Also note that the model is predicting in a regime it has not been trained in, see comment in paper Table 4 and associated section.