Add contact predictions to evaluation framework

csjackson0 commented 9 months ago

Adding two contact prediction methods to evaluation framework per issue #19

Categorical jacobian
- Used https://github.com/sokrypton/algosb_2021/blob/main/BERT_esm1b.ipynb to compute the categorical jacobian.
- Created a utils.py file under evaluation/scripts using functions from https://github.com/sokrypton/algosb_2021/blob/main/utils.py to compute contacts.
ESM contact prediction
- Added a modules.py file under modeling/utils that implements the ContactPredictionHead from https://github.com/facebookresearch/esm/blob/main/esm/modules.py#L317
- Updated APTLMHeadModel class with a predict_contacts function

Both methods are implemented in contact_prediction.ipynb file under evalutation/scripts. Integrated functions from the esm contact_prediction.ipynb https://github.com/facebookresearch/esm/blob/main/examples/contact_prediction.ipynb

pascalnotin commented 9 months ago

Thank you @csjackson0!

It would be more practical for us down to the line to perform contact prediction in a python script vs in a notebook, so that we can more easily kick off a full evaluation routine (including fitness prediction and design) for each model variant we will train. Would it be possible to adjust the PR accordingly?

Regarding testing, perhaps we could have a test unit that loads ESM2 and the same datasets as in the notebook and confirm we match the same results?

csjackson0 commented 9 months ago

Sounds good! I will adjust the PR to perform contact prediction in a python script.

I will also add the test unit you suggested.

csjackson0 commented 9 months ago

@pascalnotin The PR is adjusted so that contact prediction is performed in a python script.

Created "contact_prediction.py" under evaluations/scripts.
Added a section to the README.md called "Evaluation" that provides some details on how to run the script and the expected outputs. It also has references to the methods.
Added "biotite" dependency to the "protein_lm.yml" file.
Created "test_contact_prediction.py" and added unit tests for both methods.

For testing, I ran the ESM2 notebook using the provided .a3m files and confirmed we match the same results.

OpenBioML / protein-lm-scaling

Add contact predictions to evaluation framework #54