ersilia-os / ersilia

The Ersilia Model Hub, a repository of AI/ML models for infectious and neglected disease research.
https://ersilia.io
GNU General Public License v3.0
225 stars 147 forks source link

🦠 Model Request: Finetuned ImageMol Model for GPCR assays from ChEMBL datasets #571

Closed DhanshreeA closed 1 year ago

DhanshreeA commented 1 year ago

Model Name

imagemol-gpcr

Model Description

Representation Learning Framework that utilizes molecule images for encoding molecular inputs as machine readable vectors for downstream tasks such as bio-activity prediction, drug metabolism analysis, or drug toxicity prediction. The approach utilizes transfer learning, that is, pre-training the model on massive unlabeled datasets to help it in generalizing feature extraction and then fine tuning on specific tasks. The is a regression model and will be fine tuned on top 10 GPCR datasets with the largest number of reported ligands from Chembel datasets.

Slug

image-mol-gpcr

Tag

GPCR,regression

Publication

https://www.nature.com/articles/s42256-022-00557-6

Source code

https://github.com/HongxinXiang/ImageMol

License

MIT License

DhanshreeA commented 1 year ago

The dataset consists of 10 assays and a pre-trained ImageMol will be fine tuned per assay and together they will be incorporated into Ersilia as a unified model. image

DhanshreeA commented 1 year ago

@GemmaTuron @miquelduranfrigola I understand what GPCRs basically are - they're proteins which exists on cell membranes and kind of exist as an API for cells, ie based on the input, some response is activated within the cell. These inputs are "ligands", basically some small molecule (which comes from a drug). What I don't understand is why is this a regression task and not a classification task, ie which question does this answer:

Does molecule bind to protein? Or; What degree does this molecule bind to protein? (If it's this, then what is the unit of measurement for these values?)

GemmaTuron commented 1 year ago

/approve

github-actions[bot] commented 1 year ago

New Model Repository Created! 🎉

@DhanshreeA ersilia model respository has been successfully created and is available at:

🔗 ersilia-os/eos93h2

Next Steps ⭐

Now that your new model respository has been created, you are ready to start contributing to it!

Here are some brief starter steps for contributing to your new model repository:

Note: Many of the bullet points below will have extra links if this is your first time contributing to a GitHub repository

Additional Resources 📚

If you have any questions, please feel free to open an issue and get support from the community!

DhanshreeA commented 1 year ago

Waiting for the authors to get back with fine tuned models for these assays as reproducing the results from the paper has been challenging because the exact choice of hyper parameters is not known, and takes some experimentation when considering different possible values from the range of values the authors have provided in their supplementary materials.

DhanshreeA commented 1 year ago

Have pinged the authors again and they have replied on email ( @miquelduranfrigola ) is CC'd. Hopefully they will share the model checkpoints soon.

Same for #572

DhanshreeA commented 1 year ago

Hi @GemmaTuron could you please add me to this repository? the authors have finally shared the models and since this repo is old, it doesn't have the mock.csv tracked in LFS. Alternatively, you could just update the mock.csv in this repo.

DhanshreeA commented 1 year ago

@GemmaTuron @miquelduranfrigola I understand what GPCRs basically are - they're proteins which exists on cell membranes and kind of exist as an API for cells, ie based on the input, some response is activated within the cell. These inputs are "ligands", basically some small molecule (which comes from a drug). What I don't understand is why is this a regression task and not a classification task, ie which question does this answer:

Does molecule bind to protein? Or; What degree does this molecule bind to protein? (If it's this, then what is the unit of measurement for these values?)

@GemmaTuron could you please answer this? It will help me update the Inference section better as well.

GemmaTuron commented 1 year ago

Hi @DhanshreeA !

The model predicts Drug - Protein binding activity. Drug-protein binding can be measured in different manners, and I don't see much explanation in the paper indeed. What is the headers of the raw datasets? This might provide some more information. For the interpretation, I'd say: Binding activity of the input molecule to the following GPCRs: ... changing the ... for the GPCR acronyms

DhanshreeA commented 1 year ago

Hi @DhanshreeA !

The model predicts Drug - Protein binding activity. Drug-protein binding can be measured in different manners, and I don't see much explanation in the paper indeed. What is the headers of the raw datasets? This might provide some more information. For the interpretation, I'd say: Binding activity of the input molecule to the following GPCRs: ... changing the ... for the GPCR acronyms

Thanks for the suggestion @GemmaTuron. The headers of the raw datasets (that at least the authors have provided) are the same as here From the supplementary materials, they mention:

The top 10 G protein coupled receptors (GPCRs) datasets with the largest number of reported ligands from ChEMBL database (https://www.ebi.ac.uk/chembl/) are used to predict drug-protein binding affinity (regression task)

I looked up one of the receptors (AA1R - Adenosine Receptor A1) on ChEMBL and it seems like the values maybe some sort of binding efficiency index. I could be wrong though.

DhanshreeA commented 1 year ago

However @GemmaTuron, I have another question - what is an appropriate tag for this? :sweat_smile: In my understanding this would come under Target identification, what do you think?

GemmaTuron commented 1 year ago

The ChEMBL link is very helpful thanks @DhanshreeA ! It seems the measure they are using is the BEI (BEI = (pKi, pKd, or pIC50) / (molecular weight, kDa)) or SEI (or combination of both?). To make sure we are indicating the right measurement, could you check if the output values for a few molecules make sense? (they are in the same range as the ones indicated in ChEMBL --maybe take molecules with high ligand efficiency and check what they score)

Tags: GPCR (this should be added in the "Targets" file, under hub/content/metadata) -- can you make a new PR to include it? Target identification

DhanshreeA commented 1 year ago

Hi @GemmaTuron that's a good idea and I tested the model for AA1R against a few compounds from the ChEMBL page and here are the results:

I think it's also worth noting that the model's RMSE across this assay has been reported to be 0.711±0.012. Across these five compounds that I've tested (while a small set), the RMSE comes up to ~0.62 which seems to be consistent with the results reported, and is also an indicator that the measure being used is likely SEI.

Should I update the interpretation in the readme to reflect that this is the measure being used (or likely being used?)

DhanshreeA commented 1 year ago

Associated PR: https://github.com/ersilia-os/eos93h2/pull/1 This model is ready to be tested by others.

GemmaTuron commented 1 year ago

Sounds Good thanks Dhanshree! Please update the interpretation accordingly. I'll close this issue