OATML-Markslab / ProteinNPT

Official code repository for the paper "ProteinNPT: Improving Protein Property Prediction and Design with Non-Parametric Transformers"
MIT License
84 stars 8 forks source link

fine-tuning with more protein sequences #18

Open avilella opened 3 months ago

avilella commented 3 months ago

Hi, I have a corpus of about 500,000 protein sequences and would like to apply them to existing models like ESM2 or this one for predicting the fitness effect of changing an amino-acid for another. How could I add my sequences to the models referred in this repo to then use the modified model for such task? Thanks.

pascalnotin commented 2 months ago

Hi @avilella -- do you have property annotations for these 500k sequences? Or just the amino acid sequences w/ no annotation? ProteinNPT is first and foremost a model that learns a joint distribution of sequences and corresponding labels, so it is not the most adapted to your setting if there is no such label/annotation. If no label, you may be interested in the various zero-shot baselines we have integrated in the ProteinGym benchmark. Best, Pascal