dauparas / ProteinMPNN

Code for the ProteinMPNN paper
MIT License
934 stars 284 forks source link

Criteria for selecting top-ranked sequence? #44

Closed mlnance closed 1 year ago

mlnance commented 1 year ago

Hello!

Context: I used ProteinMPNN to generate 500 designed sequences for a loop region in my structure. The info on the first three sequences are shown below. After the first 5 or so sequences, score and global_score no longer seem to be in any rank order, so I am assuming the sequences in the output .fa file do not get ranked and reorganized?

>T=0.1, sample=1, score=1.1318, global_score=1.6505, seq_recovery=0.1250
>T=0.1, sample=2, score=1.1756, global_score=1.6445, seq_recovery=0.1250
>T=0.1, sample=3, score=1.1776, global_score=1.6387, seq_recovery=0.1875

Question: Which metric should I be using to rank these sequences? score? global_score? Is a higher number better, or a lower number? Or should I be using an external metric for ranking? I do not have access to a GPU for accelerated AlphaFold modeling, so I need to narrow down the sequences to model beforehand.

Many thanks for your help!! Morgan

dauparas commented 1 year ago

Hello!

Yes, sequences are not ranked in the output.

The score is defined as a negative average log probability of amino acid given protein backbone, so lower is better. global_score reports the score over the whole protein including any fixed residues, and fixed chains (i..e those amino acid positions are not designed) and score reports the score over the designed residues only.

I suggest ranking sequences using structure prediction methods. You could try running ESM-Fold here: https://esmatlas.com/resources?action=fold Alternative use global_score from the ProteinMPNN outputs.

Best wishes, Justas

mlnance commented 1 year ago

Thanks for the reply and information, Justas! Cheers