ersilia-os / ersilia

The Ersilia Model Hub, a repository of AI/ML models for infectious and neglected disease research.
https://ersilia.io
GNU General Public License v3.0
189 stars 123 forks source link

[Initiative]: Annotate Ersilia's models following BioModels standards #1059

Open miquelduranfrigola opened 4 months ago

miquelduranfrigola commented 4 months ago

Summary

We have partnered with BioModels at EMBL-EBI (Hinxton) to explore potential ways to incorporate Ersilia's models into well-established BioModels resource.

Of note, BioModels model annotation is based on ontologies as reported in the Ontology Lookup Service. We expect to reach similar standards thanks to the current project.

Scope

Initiative πŸ‹

Objective(s)

The objectives of the project are the following:

  1. Incorporate Ersilia's models into BioModels (metadata only).
  2. Adopt an ontology-based model annotation procedure for Ersilia that is harmonized with that of BioModels.
  3. Set the basis for a more ambitious incorporation of models based on ONNX format.

Team

Role & Responsibility Username(s)
DRI / Lead Developer @Zainab-ik
Project Manager @miquelduranfrigola

@Zainab-ik is currently doing an internship at EBI-EMBL in the BioModels team.

Importantly, @Zainab-ik will meet with @miquelduranfrigola twice a week to report progress and decide next steps. Previous to the meeting, @Zainab-ik will update the corresponding model issues and, after the meeting, actionables will be reflected in the issues.

Timeline

The project timeline is still up for discussion. This are some tentative milestones:

Documentation

A backlog of models can be found in the Ersilia BioModels Spreadsheet. This spreadsheet should act as a centralized resource to keep track of progress.

The shared folder in Google Drive can be accessed here.

Zainab-ik commented 2 months ago

New Models

Zainab-ik commented 2 months ago

Antimicrobial and COVID models uploaded to BioModels

GemmaTuron commented 2 months ago

Hey @Zainab-ik

Before starting with new models, can you have a look at the existing ones and make sure they all comply with the latest decisions we have made? Note down here any changes that had to be made in the annotations.

thanks!

Zainab-ik commented 2 months ago

Hey @Zainab-ik

Before starting with new models, can you have a look at the existing ones and make sure they all comply with the latest decisions we have made? Note down here any changes that had to be made in the annotations.

thanks!

Yes, working on that.

Zainab-ik commented 2 months ago

Previous Model review Summary - Removed general metadata, and confirmed experimental validation

Zainab-ik commented 2 months ago

Regarding the first 2 models; eos7kpb, eos80ch

GemmaTuron commented 2 months ago

Hi @Zainab-ik

Good on the corrections, as we discussed let's leave all the biological endpoints on eos7kpb

Zainab-ik commented 2 months ago

Update: eos4zfy ready for review.

BioModels Upload;

  1. All revised model have been re-uploaded
  2. New model upload

To-do's

Zainab-ik commented 2 months ago

Automating Metadata Annotation using Zooma This process involves mapping the right ontology to the metadata automatically to speed up annotation process For this process, I'd be starting with these two models

Steps;

  1. Extract relevant metadata manually
  2. Copy the metadata on Zooma to Annotate
  3. Compare annotation accuracy with manual annotation.

Comments/Observation

Zainab-ik commented 2 months ago

Coloring molecules model annotation

All ready for review.

More permeability model annotation

Zainab-ik commented 1 month ago

Models uploaded to BioModels

Zainab-ik commented 1 month ago

New model Annotation - In Progress

eos2lqb - issue eos6oli - issue eos7d58 - issue eos8lok - issue

Note: I've been working with a lot of regression model recently which is quite exciting. One of the evaluating metrics is root-mean-square error (RMSE), which I believe is also known as RMSD while reading. On OLS, RMSE doesn't exists but RMSD does, and i've been using that in my annotation.

GemmaTuron commented 1 month ago

Hi @Zainab-ik !

I'm having a look at the models you are annotating, let me know when the excel files are ready - RMSE and RMSD are the same ;)

Zainab-ik commented 1 month ago

Hi @Zainab-ik !

I'm having a look at the models you are annotating, let me know when the excel files are ready - RMSE and RMSD are the same ;)

Alright, Thanks @GemmaTuron

Zainab-ik commented 1 month ago

New model Annotation - In Progress

eos2lqb - issue eos6oli - issue eos7d58 - issue eos8lok - issue

Note: I've been working with a lot of regression model recently which is quite exciting. One of the evaluating metrics is root-mean-square error (RMSE), which I believe is also known as RMSD while reading. On OLS, RMSE doesn't exists but RMSD does, and i've been using that in my annotation.

Hi @GemmaTuron All models ready for review except eos7d58. It has a broad output and I'd like to comfirm if all the output are incorporated into the Ersilia version.

Zainab-ik commented 1 month ago

Grover Models

General comments about the Grover model

Zainab-ik commented 1 month ago

eos7w6n - This is the base model (GROVER) that was fine-tuned for task-specific dataset.

Grover Models - Annotation in Progress (Metadata extraction and curation done)

Zainab-ik commented 1 month ago

eos7w6n - This is the base model (GROVER) that was fine-tuned for task-specific dataset.

Grover Models - Annotation in Progress (Metadata extraction and curation done)

All models ready for review.

GemmaTuron commented 1 month ago

Hi @Zainab-ik

Those look good, just a comment on QSAR - I would not annotate the general model as a QSAR. Grover is applied to different datasets as a molecular representation for QSAR or QSPR (structure-activity and structure-property)

GemmaTuron commented 1 month ago

Let's pause model annotation here for the weel and focus on the documentation of the process - which will also be needed for the Hackathon: Let's use this document to create the information and then we will move it to Gitbook.

Tasks:

Zainab-ik commented 1 month ago

Update - All Grover models incorporated into BioModels.

Non-grover models uploaded

Zainab-ik commented 1 month ago

this document

Currently working on Documentation.

GemmaTuron commented 1 month ago

Good job @Zainab-ik

Let me know if you need help/review in the documentation process

Zainab-ik commented 3 weeks ago

Let's pause model annotation here for the weel and focus on the documentation of the process - which will also be needed for the Hackathon: Let's use this document to create the information and then we will move it to Gitbook.

Tasks:

  • [x] Create Documentation
  • [x] Incorporate it into GitBook
  • [x] Prepare the intro for the Hackathon
  • [ ] convert one model to ONNX to try out

All task done except ONNX conversion.

Zainab-ik commented 3 weeks ago

Current Tasks:

For the Hackathon, There are 5 open Models for Annotation.

  1. eos1n4b
  2. eos92sw
  3. eos2ta5
  4. eos9sa2
  5. eos7pw8
Zainab-ik commented 3 weeks ago

Model 1 - eos1n4b - issue

Model 2 - eos92sw - issue

Model 3 - eos2ta5 - issue This is quite clear. Just to clarify, negative predictive value (NPV), and positive predictive value (PPV) are True positive, and False positive, right?

GemmaTuron commented 2 weeks ago

Hi @Zainab-ik Good job in those with the Hackathon team. See below my comments:

Model 1 - eos1n4b - https://github.com/ersilia-os/eos1n4b/issues/8 The drug target "Histone deacetylase 3" is related to different diseases such as cancer, and diabetes. Those aren't related metadata - indeed, you are right This model was built using 5 algorithm and 3 descriptors; Algorithm - k-Nearest Neighbour (KNN), Support Vector Machine (SVM), Random forest (RF), eXtreme Gradient Boosting (XGBoost), Deep Neural Network (DNN). Descriptors - Mordred descriptors, MACCS key, Morgan fingerprint. The best performing model is the XGBoost with the Morgan fingerprint. (that's the deployed model to the GUI application) For our annotation, we'd only be including the best performing model and its feature. - yes, that is correct, good We have an ROC enrichment as an evaluation metrics between the validation and training dataset. I'm not sure it fits into the metadata. What do you think? We can add ROC Curve as evaluation metric and that's it? Model 2 - eos92sw - https://github.com/ersilia-os/eos92sw/issues/12

Can I comfirm if this is a Neural network model? there are mentions of nodes, and layers, and the type of algorithm. As stated in the publication: In this study, we utilize a DBN so that is the type of network they chose

It's difficult to identify the exact training dataset. It's a combination of data from several database. Should we list them all, or how do we consolidate that. Yes, list the databases in Table 1

All these algorithm were mentioned; Restricted Boltzmann Machine (RBM), Deep Belief Network (DBN), Extremely Randomized Trees or Extra Trees (ET), algorithm. However, there was more emphasis on ET algorithm. Wasn't it a DBN? Both classification and regression task evaluation was done. In the Ersilia Repository, there was only a mention of regression as task, should we stick to that for the evaluation metric? Yes

Model 3 - eos2ta5 - https://github.com/ersilia-os/eos2ta5/issues/6 This is quite clear. Just to clarify, negative predictive value (NPV), and positive predictive value (PPV) are True positive, and False positive, right? proportion of values that are True Negative and True Positive respectively if I am not wrong