OpenBioML / protein-lm-scaling

Other
54 stars 15 forks source link

Eval Script update #50

Closed Muedi closed 9 months ago

Muedi commented 9 months ago

Hi, as discussed before, I pulled back to before adding all the ESM code and just used the package which worked perfectly without further changes.

Points to discuss:

remaining tasks:

Best eragrds, Max

Muedi commented 9 months ago

I just added the changes to run the eval APT with Pascals code from RITA.

As is we can choose between supervised or not and between AUTOREG or not which both only works with either an autoregressive model or MLM models.

This is not very elegant but works :D

Down the line we should either include command line args and use it as a non interactive script or I convert it to a jupyter notebook and add more comprehensive comments/text in the notebook.

pascalnotin commented 9 months ago

Thank you so much @Muedi ! I finally had the chance to have a look :) What you wrote works and covers the main use cases. I have a few suggestions to make it easier to use and maintain down the line:

  1. Let's rename Protein-gymp.y to evaluate_fitness_proteingym.py
  2. Let's use argparse for the script parameter vs setting these manually within the script. Here is an example we used for Tranception: https://github.com/OATML-Markslab/Tranception/blob/main/score_tranception_proteingym.py#L18
  3. Let's break down the code in a few separate modules that are called from the main evaluate_fitness_proteingym.py script:
    • one script called "download_proteingym_data.py" which does exactly what you do on lines 85-124
    • one script called "fitness_supervised.py" which replicates lines 161-230. This mode is a second priority I would say -- we will primarily rely on zero-shot eval, but sounds good to keep it since you have already coded it! (would be interesting to see results there towards the end of the project)
    • one script called "fitness_zero_shot_AR.py" which replicates the autoregressive eval
    • one script called "fitness_zero_shot_ESM.py" which replicates the MLM eval from ESM
  4. Re: model loading:
    • for autoregressive, we should adapt to load a model from the APT class by default (based on checkpoint location passed as argument)
    • for MLM, same thing but ESM2 by default

Let me know if that makes sense!

Muedi commented 9 months ago

all sensible to me, I'll add the changes in the coming days and just push them here :)

pascalnotin commented 9 months ago

Great - thank you, Max!

Muedi commented 9 months ago

Hi,

I added the files as requested, they run when called from the base directory. I also added the yaml file and readme changes for GPU use.

best, Max