Closed DhanshreeA closed 1 year ago
The dataset consists of 10 assays and a pre-trained ImageMol will be fine tuned per assay and together they will be incorporated into Ersilia as a unified model.
@GemmaTuron @miquelduranfrigola I understand what GPCRs basically are - they're proteins which exists on cell membranes and kind of exist as an API for cells, ie based on the input, some response is activated within the cell. These inputs are "ligands", basically some small molecule (which comes from a drug). What I don't understand is why is this a regression task and not a classification task, ie which question does this answer:
Does molecule bind to protein? Or; What degree does this molecule bind to protein? (If it's this, then what is the unit of measurement for these values?)
/approve
@DhanshreeA ersilia model respository has been successfully created and is available at:
Now that your new model respository has been created, you are ready to start contributing to it!
Here are some brief starter steps for contributing to your new model repository:
Note: Many of the bullet points below will have extra links if this is your first time contributing to a GitHub repository
README.md
file to accurately describe your modelIf you have any questions, please feel free to open an issue and get support from the community!
Waiting for the authors to get back with fine tuned models for these assays as reproducing the results from the paper has been challenging because the exact choice of hyper parameters is not known, and takes some experimentation when considering different possible values from the range of values the authors have provided in their supplementary materials.
Have pinged the authors again and they have replied on email ( @miquelduranfrigola ) is CC'd. Hopefully they will share the model checkpoints soon.
Same for #572
Hi @GemmaTuron could you please add me to this repository? the authors have finally shared the models and since this repo is old, it doesn't have the mock.csv tracked in LFS. Alternatively, you could just update the mock.csv in this repo.
@GemmaTuron @miquelduranfrigola I understand what GPCRs basically are - they're proteins which exists on cell membranes and kind of exist as an API for cells, ie based on the input, some response is activated within the cell. These inputs are "ligands", basically some small molecule (which comes from a drug). What I don't understand is why is this a regression task and not a classification task, ie which question does this answer:
Does molecule bind to protein? Or; What degree does this molecule bind to protein? (If it's this, then what is the unit of measurement for these values?)
@GemmaTuron could you please answer this? It will help me update the Inference section better as well.
Hi @DhanshreeA !
The model predicts Drug - Protein binding activity. Drug-protein binding can be measured in different manners, and I don't see much explanation in the paper indeed. What is the headers of the raw datasets? This might provide some more information.
For the interpretation, I'd say:
Binding activity of the input molecule to the following GPCRs: ...
changing the ... for the GPCR acronyms
Hi @DhanshreeA !
The model predicts Drug - Protein binding activity. Drug-protein binding can be measured in different manners, and I don't see much explanation in the paper indeed. What is the headers of the raw datasets? This might provide some more information. For the interpretation, I'd say:
Binding activity of the input molecule to the following GPCRs: ...
changing the ... for the GPCR acronyms
Thanks for the suggestion @GemmaTuron. The headers of the raw datasets (that at least the authors have provided) are the same as here From the supplementary materials, they mention:
The top 10 G protein coupled receptors (GPCRs) datasets with the largest number of reported ligands from ChEMBL database (https://www.ebi.ac.uk/chembl/) are used to predict drug-protein binding affinity (regression task)
I looked up one of the receptors (AA1R - Adenosine Receptor A1) on ChEMBL and it seems like the values maybe some sort of binding efficiency index. I could be wrong though.
However @GemmaTuron, I have another question - what is an appropriate tag for this? :sweat_smile: In my understanding this would come under Target identification, what do you think?
The ChEMBL link is very helpful thanks @DhanshreeA ! It seems the measure they are using is the BEI (BEI = (pKi, pKd, or pIC50) / (molecular weight, kDa)) or SEI (or combination of both?). To make sure we are indicating the right measurement, could you check if the output values for a few molecules make sense? (they are in the same range as the ones indicated in ChEMBL --maybe take molecules with high ligand efficiency and check what they score)
Tags: GPCR (this should be added in the "Targets" file, under hub/content/metadata) -- can you make a new PR to include it? Target identification
Hi @GemmaTuron that's a good idea and I tested the model for AA1R against a few compounds from the ChEMBL page and here are the results:
I think it's also worth noting that the model's RMSE across this assay has been reported to be 0.711±0.012. Across these five compounds that I've tested (while a small set), the RMSE comes up to ~0.62 which seems to be consistent with the results reported, and is also an indicator that the measure being used is likely SEI.
Should I update the interpretation in the readme to reflect that this is the measure being used (or likely being used?)
Associated PR: https://github.com/ersilia-os/eos93h2/pull/1 This model is ready to be tested by others.
Sounds Good thanks Dhanshree! Please update the interpretation accordingly. I'll close this issue
Model Name
imagemol-gpcr
Model Description
Representation Learning Framework that utilizes molecule images for encoding molecular inputs as machine readable vectors for downstream tasks such as bio-activity prediction, drug metabolism analysis, or drug toxicity prediction. The approach utilizes transfer learning, that is, pre-training the model on massive unlabeled datasets to help it in generalizing feature extraction and then fine tuning on specific tasks. The is a regression model and will be fine tuned on top 10 GPCR datasets with the largest number of reported ligands from Chembel datasets.
Slug
image-mol-gpcr
Tag
GPCR,regression
Publication
https://www.nature.com/articles/s42256-022-00557-6
Source code
https://github.com/HongxinXiang/ImageMol
License
MIT License