[Initiative]: Annotate Ersilia's models following BioModels standards

miquelduranfrigola commented 4 months ago

Summary

We have partnered with BioModels at EMBL-EBI (Hinxton) to explore potential ways to incorporate Ersilia's models into well-established BioModels resource.

Of note, BioModels model annotation is based on ontologies as reported in the Ontology Lookup Service. We expect to reach similar standards thanks to the current project.

Scope

Initiative 🐋

Objective(s)

The objectives of the project are the following:

Incorporate Ersilia's models into BioModels (metadata only).
Adopt an ontology-based model annotation procedure for Ersilia that is harmonized with that of BioModels.
Set the basis for a more ambitious incorporation of models based on ONNX format.

Team

Role & Responsibility	Username(s)
DRI / Lead Developer	@Zainab-ik
Project Manager	@miquelduranfrigola

@Zainab-ik is currently doing an internship at EBI-EMBL in the BioModels team.

Importantly, @Zainab-ik will meet with @miquelduranfrigola twice a week to report progress and decide next steps. Previous to the meeting, @Zainab-ik will update the corresponding model issues and, after the meeting, actionables will be reflected in the issues.

Timeline

The project timeline is still up for discussion. This are some tentative milestones:

[x] Incorporate metadata only of a simple model into BioModels (i.e. antimalarial activity prediction).
[x] Incorporate metadata only of a a more complex model into BioModels, potentially involving multiple outputs (i.e. H3D models).
[ ] Define ontology-based rules to improve Ersilia's metadata, harmonizing it with BioModels standards.
[ ] Incorporate metadata only for a substantial number of models.
[ ] Incorporate at least one model in ONNX format.

Documentation

A backlog of models can be found in the Ersilia BioModels Spreadsheet. This spreadsheet should act as a centralized resource to keep track of progress.

The shared folder in Google Drive can be accessed here.

Zainab-ik commented 2 months ago

New Models

eos4zfy - issue
eos6hy3 - issue This publication is the same as eos4cxk, and the same rules that applies to eos4e40 and eos9f6t can apply here, right?
eos42ez - issue This publication is the same as eos18ie, and same rule applies.
eos31ve - issue This is also same publication as eos9yy1

Zainab-ik commented 2 months ago

Antimicrobial and COVID models uploaded to BioModels

eos3804 - https://www.ebi.ac.uk/biomodels/MODEL2405080001
eos18ie - https://www.ebi.ac.uk/biomodels/MODEL2405080002
eos5cl7 - https://www.ebi.ac.uk/biomodels/MODEL2405080003
eos24jm - https://www.ebi.ac.uk/biomodels/MODEL2405080004
eos4cxk - https://www.ebi.ac.uk/biomodels/MODEL2405080005
eos9f6t - https://www.ebi.ac.uk/biomodels/MODEL2405080006

GemmaTuron commented 2 months ago

Hey @Zainab-ik

Before starting with new models, can you have a look at the existing ones and make sure they all comply with the latest decisions we have made? Note down here any changes that had to be made in the annotations.

thanks!

Zainab-ik commented 2 months ago

Hey @Zainab-ik

Before starting with new models, can you have a look at the existing ones and make sure they all comply with the latest decisions we have made? Note down here any changes that had to be made in the annotations.

thanks!

Yes, working on that.

Zainab-ik commented 2 months ago

Previous Model review Summary - Removed general metadata, and confirmed experimental validation

eos46ev - removed unnecessary metadata e.g., molecular representation, confirmed there's no experimental validation
eos5xng - edited the metadata. Removed; hit selection, chemical library, compound, validation dataset, in-silico approach (it's also a general term). Added in-vitro experimental validation
eos4e40 - Model was validated experimentally in-vivo and in-vitro, both metadata added, QSAR added, data source confirmed, non-specific metadata removed e.g., chemical library, molecular representation.
NCATS CYP Models; eos44zp, eos5jz9, eos7nno, eos3ev6 . Metadata removed; chemical library, hit, molecular representation.
NCATS Permeability Models; eos81ew, eos9tyg . Metadata removed; molecular representation, Permeability assay (there's already a PAMPA metadata).
NCATS Stability Models; eos5505, eos9yy1. Metadata removed; insilico model, molecular representation, chemical library, CYP metabolism (doesn't fit the context of the model), compound stability.
NCATS Solubility model; eos74bo. Metadata revised; organic molecule, hit

Zainab-ik commented 2 months ago

Regarding the first 2 models; eos7kpb, eos80ch

eos7kpb ; Physicochemical Assays Clearance Solubility assay cytotoxicity Aqueous solubility permeability assay Microsomal metabolic stability These metadata aren't integral to the Zairachem model, I want to run by you first.
eos80ch ; Removed the following metadata; compound screening, phenotype, molecular representation, molecular representation, parasites, phenotype.

GemmaTuron commented 2 months ago

Hi @Zainab-ik

Good on the corrections, as we discussed let's leave all the biological endpoints on eos7kpb

Zainab-ik commented 2 months ago

Update: eos4zfy ready for review.

BioModels Upload;

All revised model have been re-uploaded
New model upload
- eos6hy3 - https://www.ebi.ac.uk/biomodels/MODEL2405130001
- eos42ez - https://www.ebi.ac.uk/biomodels/MODEL2405130002
- eos31ve - https://www.ebi.ac.uk/biomodels/MODEL2405130005

To-do's

[x] Create a google form for the upcoming hackathon - https://forms.gle/4xadZBjvP2SfgY1b9
[x] Design a flier - here
[x] Prepare a slide deck

Zainab-ik commented 2 months ago

Automating Metadata Annotation using Zooma This process involves mapping the right ontology to the metadata automatically to speed up annotation process For this process, I'd be starting with these two models

Steps;

Extract relevant metadata manually
Copy the metadata on Zooma to Annotate
Compare annotation accuracy with manual annotation.

Comments/Observation

Biological component mapping for organism has high accuracy
Biological component mapping for property is average
Computational component mapping for property is low.

Zainab-ik commented 2 months ago

Coloring molecules model annotation

eos6ao8 - issue
eos1af5 - issue
eos43at - issue

All ready for review.

More permeability model annotation

eos97yu - issue
eos2hbd - issue Ready for review.

Zainab-ik commented 1 month ago

Models uploaded to BioModels

eos4zfy - https://www.ebi.ac.uk/biomodels/MODEL2405210002
eos96ia - https://www.ebi.ac.uk/biomodels/MODEL2405210003
eos8d8a - https://www.ebi.ac.uk/biomodels/MODEL2405210004
eos6ao8 - https://www.ebi.ac.uk/biomodels/MODEL2405210005
eos1af5 - https://www.ebi.ac.uk/biomodels/MODEL2405210006
eos43at - https://www.ebi.ac.uk/biomodels/MODEL2405210007

Zainab-ik commented 1 month ago

New model Annotation - In Progress

eos2lqb - issue eos6oli - issue eos7d58 - issue eos8lok - issue

Note: I've been working with a lot of regression model recently which is quite exciting. One of the evaluating metrics is root-mean-square error (RMSE), which I believe is also known as RMSD while reading. On OLS, RMSE doesn't exists but RMSD does, and i've been using that in my annotation.

GemmaTuron commented 1 month ago

Hi @Zainab-ik !

I'm having a look at the models you are annotating, let me know when the excel files are ready - RMSE and RMSD are the same ;)

Zainab-ik commented 1 month ago

Hi @Zainab-ik !

I'm having a look at the models you are annotating, let me know when the excel files are ready - RMSE and RMSD are the same ;)

Alright, Thanks @GemmaTuron

Zainab-ik commented 1 month ago

New model Annotation - In Progress

eos2lqb - issue eos6oli - issue eos7d58 - issue eos8lok - issue

Note: I've been working with a lot of regression model recently which is quite exciting. One of the evaluating metrics is root-mean-square error (RMSE), which I believe is also known as RMSD while reading. On OLS, RMSE doesn't exists but RMSD does, and i've been using that in my annotation.

Hi @GemmaTuron All models ready for review except eos7d58. It has a broad output and I'd like to comfirm if all the output are incorporated into the Ersilia version.

Zainab-ik commented 1 month ago

Grover Models

is Grover a framework/code base like Chemprop that's fine-tuned and trained on different datasets for different outputs?
What's a labelled and unlabelled molecular data?
What's the difference between pre-training and training an ML/DL model?
There's no clear mention of how the models were evaluated except for comparism with other models based on the mean and standard deviation. There's also a mention of % relative improvement - can that be classified as accuracy?. Are these regarded as the model evaluation metrics. (In the author-feedback section, AUC-ROC was mentioned as the metric for comparism) - This is the metric for Grover
How's the fine-tuning task evaluated? Let's say, Grover was trained on predicting Water solubility as is the case for grover-esol - eos8451, how's the model performance evaluated to be good or not? - In the supplementary file, ROC-AUC is the metric for the classification tasks while RMSE is the metric for Physical chemistry regression tasks while MAE is the metric for Quantum mechanics regression tasks. (it feels like i'm answering myself 🙂).
Can you kindly clarify validation loss and training loss. Thanks.

General comments about the Grover model

The metadata is determined by what task the Grover model is fine-tuned on.
Grover was leveraged for Molecular property prediction task and task -specific fine tuning. We'd be annotating for the task-specific fine-tuning taking note of the specific dataset, the type of task (classification/regression), and predictions.
In the context of data-splitting for fine-tuning, active and inactive suits...

Zainab-ik commented 1 month ago

eos7w6n - This is the base model (GROVER) that was fine-tuned for task-specific dataset.

Grover Models - Annotation in Progress (Metadata extraction and curation done)

eos3xip - issue
eos6o0z - issue
eos85a3 - issue
eos8451 - issue
eos157v - issue
eos481p - issue
eos2mhp - issue
eos6fza - issue
eos5smc - issue
eos7w6n - issue
eos77w8 - issue
eos1amr - issue

Zainab-ik commented 1 month ago

eos7w6n - This is the base model (GROVER) that was fine-tuned for task-specific dataset.

Grover Models - Annotation in Progress (Metadata extraction and curation done)

eos3xip - issue

eos6o0z - issue

eos85a3 - issue

eos8451 - issue

eos157v - issue

eos481p - issue

eos2mhp - issue

eos6fza - issue

eos5smc - issue

eos7w6n - issue

eos77w8 - issue

eos1amr - issue

All models ready for review.

GemmaTuron commented 1 month ago

Hi @Zainab-ik

Those look good, just a comment on QSAR - I would not annotate the general model as a QSAR. Grover is applied to different datasets as a molecular representation for QSAR or QSPR (structure-activity and structure-property)

GemmaTuron commented 1 month ago

Let's pause model annotation here for the weel and focus on the documentation of the process - which will also be needed for the Hackathon: Let's use this document to create the information and then we will move it to Gitbook.

Tasks:

[ ] Create Documentation
[ ] Incorporate it into GitBook
[ ] Prepare the intro for the Hackathon
[ ] convert one model to ONNX to try out

Zainab-ik commented 1 month ago

Update - All Grover models incorporated into BioModels.

eos3xip - https://www.ebi.ac.uk/biomodels/MODEL2406040002
eos6o0z - https://www.ebi.ac.uk/biomodels/MODEL2406040003
eos85a3 - https://www.ebi.ac.uk/biomodels/MODEL2406040004
eos8451 - https://www.ebi.ac.uk/biomodels/MODEL2406040005
eos157v - https://www.ebi.ac.uk/biomodels/MODEL2406050001
eos481p - https://www.ebi.ac.uk/biomodels/MODEL2406050002
eos2mhp - https://www.ebi.ac.uk/biomodels/MODEL2406050003
eos6fza - https://www.ebi.ac.uk/biomodels/MODEL2406050004
eos5smc - https://www.ebi.ac.uk/biomodels/MODEL2406050005
eos7w6n - https://www.ebi.ac.uk/biomodels/MODEL2406050006
eos77w8 - https://www.ebi.ac.uk/biomodels/MODEL2406050007
eos1amr - https://www.ebi.ac.uk/biomodels/MODEL2406050008

Non-grover models uploaded

eos2lqb - https://www.ebi.ac.uk/biomodels/MODEL2406030001
eos6oli - https://www.ebi.ac.uk/biomodels/MODEL2406030002
eos97yu - https://www.ebi.ac.uk/biomodels/MODEL2406030003
eos8lok - https://www.ebi.ac.uk/biomodels/MODEL2406040001

Zainab-ik commented 1 month ago

this document

Currently working on Documentation.

GemmaTuron commented 1 month ago

Good job @Zainab-ik

Let me know if you need help/review in the documentation process

Zainab-ik commented 3 weeks ago

Let's pause model annotation here for the weel and focus on the documentation of the process - which will also be needed for the Hackathon: Let's use this document to create the information and then we will move it to Gitbook.

Tasks:

[x] Create Documentation

[x] Incorporate it into GitBook

[x] Prepare the intro for the Hackathon

[ ] convert one model to ONNX to try out

All task done except ONNX conversion.

Zainab-ik commented 3 weeks ago

Current Tasks:

For the Hackathon, There are 5 open Models for Annotation.

[ ] Finish up the Hackathon open models in this order

eos1n4b
eos92sw
eos2ta5
eos9sa2
eos7pw8

[ ] Annotate the endpoint of the last 2 models

Zainab-ik commented 3 weeks ago

Model 1 - eos1n4b - issue

The drug target "Histone deacetylase 3" is related to different diseases such as cancer, and diabetes. Those aren't related metadata.
This model was built using 5 algorithm and 3 descriptors; Algorithm - k-Nearest Neighbour (KNN), Support Vector Machine (SVM), Random forest (RF), eXtreme Gradient Boosting (XGBoost), Deep Neural Network (DNN). Descriptors - Mordred descriptors, MACCS key, Morgan fingerprint. The best performing model is the XGBoost with the Morgan fingerprint. (that's the deployed model to the GUI application) For our annotation, we'd only be including the best performing model and its feature.
We have an ROC enrichment as an evaluation metrics between the validation and training dataset. I'm not sure it fits into the metadata. What do you think?

Model 2 - eos92sw - issue

Can I comfirm if this is a Neural network model? there are mentions of nodes, and layers, and the type of algorithm.
It's difficult to identify the exact training dataset. It's a combination of data from several database. Should we list them all, or how do we consolidate that.
All these algorithm were mentioned; Restricted Boltzmann Machine (RBM), Deep Belief Network (DBN), Extremely Randomized Trees or Extra Trees (ET), algorithm. However, there was more emphasis on ET algorithm.
Both classification and regression task evaluation was done. In the Ersilia Repository, there was only a mention of regression as task, should we stick to that for the evaluation metric?

Model 3 - eos2ta5 - issue This is quite clear. Just to clarify, negative predictive value (NPV), and positive predictive value (PPV) are True positive, and False positive, right?

GemmaTuron commented 2 weeks ago

Hi @Zainab-ik Good job in those with the Hackathon team. See below my comments:

Model 1 - eos1n4b - https://github.com/ersilia-os/eos1n4b/issues/8 The drug target "Histone deacetylase 3" is related to different diseases such as cancer, and diabetes. Those aren't related metadata - indeed, you are right This model was built using 5 algorithm and 3 descriptors; Algorithm - k-Nearest Neighbour (KNN), Support Vector Machine (SVM), Random forest (RF), eXtreme Gradient Boosting (XGBoost), Deep Neural Network (DNN). Descriptors - Mordred descriptors, MACCS key, Morgan fingerprint. The best performing model is the XGBoost with the Morgan fingerprint. (that's the deployed model to the GUI application) For our annotation, we'd only be including the best performing model and its feature. - yes, that is correct, good We have an ROC enrichment as an evaluation metrics between the validation and training dataset. I'm not sure it fits into the metadata. What do you think? We can add ROC Curve as evaluation metric and that's it? Model 2 - eos92sw - https://github.com/ersilia-os/eos92sw/issues/12

Can I comfirm if this is a Neural network model? there are mentions of nodes, and layers, and the type of algorithm. As stated in the publication: In this study, we utilize a DBN so that is the type of network they chose

It's difficult to identify the exact training dataset. It's a combination of data from several database. Should we list them all, or how do we consolidate that. Yes, list the databases in Table 1

All these algorithm were mentioned; Restricted Boltzmann Machine (RBM), Deep Belief Network (DBN), Extremely Randomized Trees or Extra Trees (ET), algorithm. However, there was more emphasis on ET algorithm. Wasn't it a DBN? Both classification and regression task evaluation was done. In the Ersilia Repository, there was only a mention of regression as task, should we stick to that for the evaluation metric? Yes

Model 3 - eos2ta5 - https://github.com/ersilia-os/eos2ta5/issues/6 This is quite clear. Just to clarify, negative predictive value (NPV), and positive predictive value (PPV) are True positive, and False positive, right? proportion of values that are True Negative and True Positive respectively if I am not wrong

ersilia-os / ersilia