dauparas / ProteinMPNN

Code for the ProteinMPNN paper
MIT License
934 stars 284 forks source link

Why sequence design takes protein sequences as inputs? #37

Closed yyou1996 closed 1 year ago

yyou1996 commented 1 year ago

Dear ProteinMPNN authors,

Thank you very much for your great efforts. I seem to meet some obstacles to conceptually understand the implementation. Per my reading of the paper, the design task should be formulated as input the backbone structure only, and return the designed sequences.

However, the provided example (e.g. submit_example_3.sh - directly from the .pdb path) actually take both backbones and sequences as inputs, and return the re-sampled sequences, if I do not misunderstand the codes: https://github.com/dauparas/ProteinMPNN/blob/e61ecb7e3c32e630ff7a34d16c3a43fcf8f8a8bd/protein_mpnn_run.py#L302 https://github.com/dauparas/ProteinMPNN/blob/e61ecb7e3c32e630ff7a34d16c3a43fcf8f8a8bd/protein_mpnn_utils.py#L1057

It would be appreciated if you could comment on whether I have the correct understanding. If yes, which script would be the one for me, that I would like to input the backbone structure only, and return the designed sequences?

data2code commented 1 year ago

Could the authors please comment on the above question? We are also curious why ProteinMPNN needs to peek at the sequence, was it just to compute sequence recovery rate at the end? Sequence prediction itself should not be influenced by the input sequence, right? Thanks!

dauparas commented 1 year ago

Hello! The model.forward function takes the input sequence because the model is autoregressive, i.e. it models p(AA_n|backbone, AA_1, AA_2,...,AA_{n_1}). It uses the input sequence context when the model is trained, or being used to score a sequence. This is called teacher forcing. On the other hand, if you look at model.sample function, in that case, the model can generate new sequences without relying on the input sequence.

data2code commented 1 year ago

Thank you so much!