facebookresearch / esm

Evolutionary Scale Modeling (esm): Pretrained language models for proteins
MIT License
3.24k stars 641 forks source link

Predicting over multiple chains/restricting design on part of the chain #188

Closed gezmi closed 2 years ago

gezmi commented 2 years ago

Hi,

I would like to be able to predict interfaces between proteins with the model. Is it possible to load multiple chains and/or only run predictions on part of them?

Thank you!

avilella commented 2 years ago

I am also interested in restricting to a segment or segments of the input, e.g. if --chain C, then restrict to certain coordinates of chain C. Is this what the <mask> notation is for?

tomsercu commented 2 years ago

Let me make sure I understand the two questions:

  1. "load multiple chains and/or only run predictions on part of them?" @gezmi
    • IIUC: No coordinate masking, but you want to sample sequence variations for only part of the sequence while keeping the rest fixed.
    • seems reasonable to do.
    • Keep in mind that the sampling is autoregressive, so your sample would not really take the suffix into account. You could consider taking the total log-likelihood (including suffix or part thereof) to judge the quality of your sampled part.
  2. @avilella "restricting to certain coordinates of chain C"
    • Sound like you want to mask out the majority of the chain and predict for those positions?
    • Could technically be done, see the inverse folding readme
    • However see paper Fig 4 for performance detoriation with longer masked-out regions. Not sure the model will do anything reasonable if a lot of coordinates are masked out.

* Paper Hsu et al. 2022.