Eval Script update - Githubissues

Muedi commented 9 months ago

Hi, as discussed before, I pulled back to before adding all the ESM code and just used the package which worked perfectly without further changes.

wt-/masked marginals work fine, I added a split and loop over all mutants to be compatible with multimutants. The given score for a multi mutant is averaged over all mutants.
I changed pseudo-ppl to work with the input mutated_sequence directly, instead of changing the base sequence as in the original script.
I added evaluate and fair-esm to the yaml as dependencies in the pip section.

Points to discuss:

The Pseudo-PPL runs forwever, I have a 4090 running and the tqdm telle me the first dataframe alone will tame 1600 hours. These estimates normally go down during run time, but it'll undaoubtedly be a long runner. Any ideas to reduce this time? I don't think well be able to reduce this much as we need to compute the likelihood for each AA in each sequence... I'll ask @NZ99 for cluster access perhaps?

remaining tasks:

include seperate logic for sequences that exceed context size. @pascalnotin told me he'll be able to add this from already established code of another project :)
Adding the eval case for APT/other autogressiv models. I'll start this today Pascal pointed me at this: https://github.com/lightonai/RITA/blob/master/compute_fitness.py

Best eragrds, Max

Muedi commented 9 months ago

I just added the changes to run the eval APT with Pascals code from RITA.

As is we can choose between supervised or not and between AUTOREG or not which both only works with either an autoregressive model or MLM models.

This is not very elegant but works :D

Down the line we should either include command line args and use it as a non interactive script or I convert it to a jupyter notebook and add more comprehensive comments/text in the notebook.

pascalnotin commented 9 months ago

Thank you so much @Muedi ! I finally had the chance to have a look :) What you wrote works and covers the main use cases. I have a few suggestions to make it easier to use and maintain down the line:

Let's rename Protein-gymp.y to evaluate_fitness_proteingym.py
Let's use argparse for the script parameter vs setting these manually within the script. Here is an example we used for Tranception: https://github.com/OATML-Markslab/Tranception/blob/main/score_tranception_proteingym.py#L18
Let's break down the code in a few separate modules that are called from the main evaluate_fitness_proteingym.py script:
- one script called "download_proteingym_data.py" which does exactly what you do on lines 85-124
- one script called "fitness_supervised.py" which replicates lines 161-230. This mode is a second priority I would say -- we will primarily rely on zero-shot eval, but sounds good to keep it since you have already coded it! (would be interesting to see results there towards the end of the project)
- one script called "fitness_zero_shot_AR.py" which replicates the autoregressive eval
- one script called "fitness_zero_shot_ESM.py" which replicates the MLM eval from ESM
Re: model loading:
- for autoregressive, we should adapt to load a model from the APT class by default (based on checkpoint location passed as argument)
- for MLM, same thing but ESM2 by default

Let me know if that makes sense!

Muedi commented 9 months ago

all sensible to me, I'll add the changes in the coming days and just push them here :)

pascalnotin commented 9 months ago

Great - thank you, Max!

Muedi commented 9 months ago

Hi,

I added the files as requested, they run when called from the base directory. I also added the yaml file and readme changes for GPU use.

best, Max

OpenBioML / protein-lm-scaling

Eval Script update #50