Create basic eval script for ProteinGym benchmark

OpenBioML / protein-lm-scaling

Other

55 stars 14 forks source link

Create basic eval script for ProteinGym benchmark #4

Closed pascalnotin closed 12 months ago

Muedi commented 1 year ago

WHat do we want as a 'basic'eval script? Should I try to make a script downloading and preprocessing the csvs and then use esm-2 as a placeholder, get its final hidden states feed these to one row of logits? Would this work as a zero-shot test? Any suggestions in this direction would be welcome, I only used my own models or pretrained ones as is :)

If the script runs, we can later use our models instead and have a comparison with esm already built in :)

pascalnotin commented 1 year ago

Thanks @Muedi ! What you suggest is great for the semi-supervised property prediction setting.

For the pure zero-shot setting with ESM models we could be using the ESM-1v masked-marginal approach as describe here: https://github.com/facebookresearch/esm/blob/main/examples/variant-prediction/predict.py (that's what we've been using when reporting the corresponding model performance in ProteinGym). Works nearly out of the box for the single-sequence only ESM models (eg., ESM1b, ESM1v, ESM2), except for sequences that are longer than the context length (ie longer than 1023 AAs, if we count the BOS token).

Muedi commented 1 year ago

Is the code available where it's adapted to proteinGym too?

Then I'll gladly take that and put the script together, perhaps with a true/false flag if we want to get zero shot results or fine-tuned? Otherwise I'll adapt what you linked already :)

pascalnotin commented 1 year ago

Sounds good regarding the binary flag. Our code is not yet open source but the ESM script should be good enough for testing things out (95% of assay sequences are below the 1023 threshold). Note that this script will only be relevant for maskeg language modeling models, not the AR models we intend to train. So we will need another script parameter to choose model type (ESM vs AR) and then the code will use the relevant zero-shot scoring function/utils.

Muedi commented 1 year ago

@pascalnotin Hi to implement the script you provided We need the base sequence. Do some experiments have multiple base seqs? If not I already have written a function that returns the base seqs for all experiments during preprocessing :)

pascalnotin commented 1 year ago

Hi @Muedi -- each assay mutates a single reference sequence. We have a reference file with all these ref sequences in the repo (these cant be inferred from mutants as the mutated range is sometimes just a subset of the full protein). Link to reference file: https://github.com/OATML-Markslab/ProteinGym/blob/main/ProteinGym_reference_file_substitutions.csv

Muedi commented 1 year ago

Ok, thanks, will revert the respective function then and instead download this file too :)

Muedi commented 1 year ago

/take

pascalnotin commented 12 months ago

Closing this issue as it was addressed by PR#50 - thank you @Muedi !