ersilia-os / ersilia

The Ersilia Model Hub, a repository of AI/ML models for infectious and neglected disease research.
https://ersilia.io
GNU General Public License v3.0
189 stars 123 forks source link

[Initiative]: Annotate Ersilia's models following BioModels standards #1059

Open miquelduranfrigola opened 3 months ago

miquelduranfrigola commented 3 months ago

Summary

We have partnered with BioModels at EMBL-EBI (Hinxton) to explore potential ways to incorporate Ersilia's models into well-established BioModels resource.

Of note, BioModels model annotation is based on ontologies as reported in the Ontology Lookup Service. We expect to reach similar standards thanks to the current project.

Scope

Initiative πŸ‹

Objective(s)

The objectives of the project are the following:

  1. Incorporate Ersilia's models into BioModels (metadata only).
  2. Adopt an ontology-based model annotation procedure for Ersilia that is harmonized with that of BioModels.
  3. Set the basis for a more ambitious incorporation of models based on ONNX format.

Team

Role & Responsibility Username(s)
DRI / Lead Developer @Zainab-ik
Project Manager @miquelduranfrigola

@Zainab-ik is currently doing an internship at EBI-EMBL in the BioModels team.

Importantly, @Zainab-ik will meet with @miquelduranfrigola twice a week to report progress and decide next steps. Previous to the meeting, @Zainab-ik will update the corresponding model issues and, after the meeting, actionables will be reflected in the issues.

Timeline

The project timeline is still up for discussion. This are some tentative milestones:

Documentation

A backlog of models can be found in the Ersilia BioModels Spreadsheet. This spreadsheet should act as a centralized resource to keep track of progress.

The shared folder in Google Drive can be accessed here.

miquelduranfrigola commented 3 months ago

Hello @Zainab-ik, as discussed, let's start by conceiving an issue template to prompt discussion about each model individually.

I suggest that we start by doing this in the current antimalarial model, then we can replicate the template to other models as we see fit. In my opinion, the template should not be too complex.

miquelduranfrigola commented 3 months ago

@Zainab-ik here are some questions in preparation with our meeting with Sheriff. Feel free to add more:

Zainab-ik commented 3 months ago

I'd be working on the issue template. Note: The Ersilia BioModel spreadsheet seems to be empty.

Zainab-ik commented 3 months ago

@Zainab-ik here are some questions in preparation with our meeting with Sheriff. Feel free to add more:

  • What is the minimum and maximum number of qualifiers in a model? How many are recommended?
  • Is there a convention for naming models? Is it the year & title of publication?
  • Is there a structure or guidelines for model descriptions?
  • Many papers have extra analysis not directly related to the model. For example, dimensionality reduction with UMAP, or clustering. Do we need to include these in the metadata?
  • Do you have any experience with the chemical information ontology?
miquelduranfrigola commented 3 months ago

I'd be working on the issue template. Note: The Ersilia BioModel spreadsheet seems to be empty.

Yes it is empty for now. Please add the two models that we are currently working on and then we will add more.

Zainab-ik commented 3 months ago

Update After meeting with Sheriff;

@miquelduranfrigola Am I missing anything?

miquelduranfrigola commented 3 months ago

Thanks @Zainab-ik - this is very useful. I don't think anything is missing. Perhaps just mention that BAO is also an important ontology to consider.

Zainab-ik commented 3 months ago

Update!!!

Zainab-ik commented 3 months ago

GitHub Issue Template

While discussing with @miquelduranfrigola, He suggested I create an issue template, open it for each models i'm annotating, link them to this main issue to keep track of the work, and finally close them after the model is uploaded to the BioModel repository.

Using the Ersilia issue template as sample, I came up with a draft and I'd like a review before incorporating into each model repository. BioModel Incorporation Issue

I'd like to ask about the issue usage considering we'd have to open in each model repository and not the general repository?

GemmaTuron commented 3 months ago

Hi @Zainab-ik

After our meeting today, please:

From my side, I'll prioritize some further models for annotation. And we have decided that, once we have completed the annotation of at least 10 models, we will start thinking about:

Zainab-ik commented 3 months ago

Hi @Zainab-ik

After our meeting today, please:

  • go ahead an open the issues in the two models we are working on following your proposed template. We will try it out and once we are happy with it, we will upload it to all repos as a template
  • Add the publications of the models in the folder
  • Finish the model annotations for both and add any questions / comments you might have on the issues, so we can initiate a discussion

From my side, I'll prioritize some further models for annotation. And we have decided that, once we have completed the annotation of at least 10 models, we will start thinking about:

  • validation of the models
  • automatically storing biomodel annotations in Ersilia

Following the meeting.

I'd work on completing the annotation, I've sorted the compact identifiers with the EBI team. I'd also try uploading one model to the BioModels with Sheriff to give a sample of what the issue template information would look like.

GemmaTuron commented 3 months ago

Hi @Zainab-ik

Thanks! This is looking good, as I stated in the model issues I suggest we have two issues, one for discussion and one we will only open once we know which data from BioModels we want to store in Ersilia as well. If you agree, then let's go ahead and use the open issues to create those "discussion" issues around models eos80ch and eos7kbp so we can fully annotate these two and then proceed onto the next ones. I'd say the second issue, to collect data from BioModels for storing in Ersilia, can be built once we have at least 10 models annotated and know better the kind of information we want to collect

Zainab-ik commented 3 months ago

Hi @GemmaTuron

I've created the discussion issue around eos80ch and eos7kpb.

I've completed the annotation of eos80ch and I'd like your review before uploading. Annotation of eos7kpb should be completed before tomorrow. I'd make changes to the uploaded file since it's not google sheet.

GemmaTuron commented 3 months ago

Thanks @Zainab-ik ! I have a few suggestions on the discussion template, let me know your thoughts

Zainab-ik commented 3 months ago

Hi @GemmaTuron

I've worked around the suggestions. Completed the annotation for the 2 models, updated the link, and added metadata information for eos7kbp. I'm clear on the eos80ch model, and it's been uploaded. I'd share when it's available to the public, that'd be by tomorrow.

Do I go ahead and start working on the priority models in the sheet?

Also, there's an option of opening an account on BioModels to review submissions. BioModels facilitates some ways to offer collaboration or review or access of models.

  1. Invite your team/colleagues/contributors to open an account on BioModels and then you add them as model contributors. You can also grant write or read permission to these contributors.
  2. Regarding review account, you can also request and open a reviewer account. Using this option is when the scientific manuscript is in the middle of the review process and reviewers ask you to allow them to look into your model. The reviewer account comes in handy in this case. Off course, this type of account only gives read-access permission.

I think 1 applies to us. I could share my submission for review. Either @GemmaTuron or @miquelduranfrigola or both can have an account, what do you think?

miquelduranfrigola commented 3 months ago

@GemmaTuron feel free to take the lead here πŸ‘ Thanks @Zainab-ik for a very clear update.

GemmaTuron commented 3 months ago

Hi @Zainab-ik !

Thanks, good start! Feedback from today's meeting:

If you are done with all the tasks before our next meeting, I suggest you have a look at the model incorporation that is still midway, but this is less prioritary

Zainab-ik commented 3 months ago

Feedback from BioModels (Sheriff) !!

I've incorporated all feedbacks into the two models. I believe both models are fully annotated.

Based on the feedback

The following are/would be standard metadata in all models;

Zainab-ik commented 3 months ago

Update!!!

DOME annotation completed and both models are up on BioModels. eos7kpb - https://www.ebi.ac.uk/biomodels/MODEL2403270001 eos80ch - https://www.ebi.ac.uk/biomodels/MODEL2403270002

This has been linked in the respective repository.

Zainab-ik commented 3 months ago

eos46ev !!!

  1. For this model, 4 ML algorithm was used in building the models. I added all to the metadata considering that the final model (deployed to the web server) is a combination of all.
  2. Although, stated that XGBoost is the best, the final model is a fusion of 4 algorithm; Random forest, Deep Neural Network, Support Vector Machine, XGBoost.
  3. Looking at the repository, I realised Ersilia only implemented XGBoost model, does that nullify the rest of the algorithm as an unimportant metadata?

A more detailed comments/question is in the issue here The curation/annotation completed and can be accessed here

Zainab-ik commented 3 months ago

eos4e40 !!!

  1. Halicin was discovered with the DNN model, as an important part of the paper, would it be an important metadata?And which category (property or output)?
  2. Halicin has bactericidal activity against Mycobacterium tuberculosis and carbapenem-resistant Enterobacteriaceae, do they classify as biological properties of the model.

A quick question

I realized the use of term active, inactive, hit, non-hit, when describing data binarization is dependent on a paper. How do we pick a standard then? They are all mapped with ontology terms except non-hit

The curation/annotation can be accessed here

Zainab-ik commented 3 months ago

eos5xng !!!

  1. ESKAPE pathogen inhibition is the experimental validation of the AI model, if i'm right? If yes, then those pathogens do not classify as a taxonomy in the metadata.
  2. For the model training and prediction, both classification and regression tasks were performed. Ersilia model only performed classification and that should be the only one included in the metadata, right?
  3. Both RMSE and MAE scores are evaluation metrics for regression tasks, if 2 is yes, then both methods would apply.

The curation/annotation completed and linked here

Zainab-ik commented 3 months ago

An open-ended Question

"How much of the model properties i.e. core model properties (e.g., packages, libraries, open source software) should be curated and annotated?" Examples below;

GemmaTuron commented 3 months ago

Hi @Zainab-ik,

Good job, thanks for the updates, please find below some comments:

Zainab-ik commented 3 months ago

Hi @Zainab-ik,

Good job, thanks for the updates, please find below some comments:

Thank you @GemmaTuron

  • I do not understand this sentence: For Proprietary data, URL should be added if it's available. If not, it should be included in the metadata for transparency. For eos7kbp, I added it and annotated it with a suitable ontology since there's no URL available. As it is proprietary data, it will never have an available URL as the data is not shared. What do you mean you have added it?

For this, I added H3D Priopetary term as a metadata and just annotated with a suitable ontology and the ontology link. I didn'r necessarily mean I added the priopetary data link. Sheriff mentioned the term should be added for transparency.

  • Regarding the updated models, please do not update them on BioModels until I have revised them and given the final OK. Remember to use this excel to track progress, if the model is still "To review" means it has not yet been approved - this way we can be sure all the information in biomodels is 100% correct

Noted @GemmaTuron, That was uploaded as a sample to have an insight into how the overview would look and if there's any comment or any changes the Ersilia team would like. I'd appreciate a feedback on that. The upload can always be updated.

  • Some of the links in the BioModels website seem broken, could you check that?

I'd inform the BioModels team. Could you please specify which so I can exactly mention.

  • Le'ts consolidate the tags for all models. Can you share with me what is the list of available tags?
Screenshot 2024-04-03 at 10 04 32

These are the lists of tags available. A new one can be proposed if that'd be more suitable for Ersilia models.

  • Are Active / Inactive properties or Outputs?

They are properties. More like data properties very relevant to the model.

Zainab-ik commented 3 months ago

eos5xng !!!

  • I opened an issue here, and added a comment below;
  1. ESKAPE pathogen inhibition is the experimental validation of the AI model, if i'm right? If yes, then those pathogens do not classify as a taxonomy in the metadata.
  2. For the model training and prediction, both classification and regression tasks were performed. Ersilia model only performed classification and that should be the only one included in the metadata, right?
  3. Both RMSE and MAE scores are evaluation metrics for regression tasks, if 2 is yes, then both methods would apply.

The curation/annotation is in progress...

This can be attended to.

Zainab-ik commented 3 months ago

Update !!!

  1. eos4e40 - https://www.ebi.ac.uk/biomodels/MODEL2404080001
  2. eos5xng - https://www.ebi.ac.uk/biomodels/MODEL2404080002
  3. eos46ev - https://www.ebi.ac.uk/biomodels/MODEL2404080003

Next Point of Action - Annotate NCATS models.

Zainab-ik commented 2 months ago

NCATS Metabolism Models !!!

Models Specifics BioModel Title Annotation File
eos3ev6 CYP3A4 Gonzalez2021 - QSAR Prediction Model for CYP3A4 Inhibitor and Substrate here
eos7nno CYP2D6 Gonzalez2021 - QSAR Prediction Model for CYP2D6 Inhibitor and Substrate here
eos5jz9 CYP2C9 Gonzalez2021 - QSAR Prediction Model for CYP2C9 Inhibitor and Substrate here
eos44zp CYP450 Gonzalez2021 - QSAR Prediction Model for CYP450 enzyme Inhibitor and Substrate here

Comments

Suggestions

GemmaTuron commented 2 months ago

Hi @Zainab-ik

Thanks for the update. A few pointers:

Zainab-ik commented 2 months ago
Zainab-ik commented 2 months ago

eos3804 !!!

  1. In the publication, they made mention of an e-coli K12 as a model organism. Does that mean the host of A. Baumannii? If yes, that'd be an in-vitro model host and can't classify as a taxonomy in the metadata, right?

Metadata curation and annotation can be accessed here

Zainab-ik commented 2 months ago

Permeability Models

eos9tyg and eos81ew; PAMPA 7.4 & PAMPA 5 !!!

Here are a few comments (from PAMPA 5 publication) ;

From Original PAMPA publication;

Errors

I noticed some errors in the eos81ew repository while looking through the model checkpoints and frameworks.

  1. In eos81ew repo, there's a readme error in this folder - a mention of kinetic aqeous solubility which belongs to eos74bo Screenshot 2024-04-15 at 14 46 31

  2. In the framework folder for eos81ew, there's also a readme description about eos74b0 and a github link about eos8ykt which doesn't seem to exist in Ersilia Model Hub. Screenshot 2024-04-15 at 14 51 12

Zainab-ik commented 2 months ago

Update !!!

NCATS Metabolism Models Uploaded on BioModels.

  1. eos3ev6 - https://www.ebi.ac.uk/biomodels/MODEL2404160001
  2. eos7nno - https://www.ebi.ac.uk/biomodels/MODEL2404160003
  3. eos5jz9 - https://www.ebi.ac.uk/biomodels/MODEL2404160002
  4. eos44zp - https://www.ebi.ac.uk/biomodels/MODEL2404160004
GemmaTuron commented 2 months ago

@Zainab-ik

Please follow the guidelines we drafted. When you start working on a new model, you should:

Please move the above comments to where they belong, and I will answer there, thanks!

When you do so, please clarify what do you refer to with this: The PubChem bioassay dataset indicated in the NCATS website is also dfferent. Different from what, and for which model?

To which model are your referring here? The publication made mention of 2 models; a classifier and a neural network. the fact that the model is a neural network does not prevent it from being a classifier at the same time

Zainab-ik commented 2 months ago

@Zainab-ik

Please follow the guidelines we drafted. When you start working on a new model, you should:

  • Mark it as ongoing on the shared Excel
  • Open an issue on the specific model repository
  • Create a file for the annotation in the shared folder
  • Add the publication in the drive

All done

Please move the above comments to where they belong, and I will answer there, thanks!

When you do so, please clarify what do you refer to with this: The PubChem bioassay dataset indicated in the NCATS website is also dfferent. Different from what, and for which model?

While going through the NCATS website, the bioassay dataset for both PAMPA are different PAMPA 5.0 - eos81ew - https://pubchem.ncbi.nlm.nih.gov/bioassay/1645871 PAMPA 7.4 - eos9tyg - https://pubchem.ncbi.nlm.nih.gov/bioassay/1508612

To which model are your referring here? The publication made mention of 2 models; a classifier and a neural network. the fact that the model is a neural network does not prevent it from being a classifier at the same time.

Thanks for clarifying this.

Zainab-ik commented 2 months ago

NCATS Permeability Models

@GemmaTuron Kindly review.

Zainab-ik commented 2 months ago

NCATS Stability models;

NCATS solubility model;

ready for review

Zainab-ik commented 2 months ago

Update !!! I opened a couple of PRs

Other NCATS models uploaded to BioModels

Zainab-ik commented 2 months ago

Antimicrobial models annotation

  1. eos24jm - issue

  2. eos5cl7 - issue

  3. eos18ie - issue

Questions

Zainab-ik commented 2 months ago

SARS-COV2 model annotation

  1. eos8fth - issue
  2. eos4cxk - issue
  3. eos9f6t - issue

Regarding eos9f6t - The publication here is the same as eose40 but this is SARS-COV2 Inhibition. the paper is discussing antibiotics but SARS-COV2 should be antiviral. Can you clarify please.

GemmaTuron commented 2 months ago

Hi @Zainab-ik !

Good job thanks for keeping it up! I have answered your questions in the respective models and below the general ones:

Zainab-ik commented 2 months ago

Hi @Zainab-ik !

Good job thanks for keeping it up! I have answered your questions in the respective models and below the general ones:

Thank you @GemmaTuron

  • I created a new tag in BioModels called Ersilia and that'd be attached to all models. - Fantastic! Questions
  • Can all the drug discovery models be referred to as a QSAR model? Mmm at the moment, most of the models we have are QSAR yes, but that might not be true in the future. @miquelduranfrigola what do you say here?

That's great. That'd mean a QSAR metadata should be constant one, right. Just a thought;can a generative model classify as QSAR too?

  • If an animal model is used to perform experimental validation of the model, should that be added as a biological properties of the model i.e.,taxonomy I don't think so, this is related to the validation but not how the dataset for the model was built.

Okay, that's clarified. What if an experimental method (in-vivo precisely) is used to generate the dataset then, should experimental method and the in-vivo model be added as a metadata then?

  • The publication here is the same as eose40 but this is SARS-COV2 Inhibition. the paper is discussing antibiotics but SARS-COV2 should be antiviral. Can you clarify please. - The antiviral model does not have a publication per se, but they developed it in parallel with the antibiotic predictor, using the ChemProp. Since the antibiotic prediction paper is the one which describes the original ChemProp development, is the most appropriate citation

The metadata would be the same except for the organism and output and adding an antiviral metadata to it.

Zainab-ik commented 2 months ago

SARS-COV2 model annotation

  1. eos8fth - issue
  2. eos4cxk - issue
  3. eos9f6t - issue

Regarding eos9f6t - The publication here is the same as eose40 but this is SARS-COV2 Inhibition. the paper is discussing antibiotics but SARS-COV2 should be antiviral. Can you clarify please.

@GemmaTuron All models ready for review.

Zainab-ik commented 2 months ago

Hi @GemmaTuron

A few clarifications from the meeting;

GemmaTuron commented 2 months ago

Hi @Zainab-ik ! I have reviewed the models, please amend them and then upload to BioModels. A few general comments from our meeting:

After redoing the current models to review, let's get back to the old ones before we move onto the new ones. Feel free to reopen the issues and note the changes that should be made

Zainab-ik commented 2 months ago

A clarification regarding the in-vivo and in-vitro, if it's used for data generation, it's not to be added, right @GemmaTuron

GemmaTuron commented 2 months ago

A clarification regarding the in-vivo and in-vitro, if it's used for data generation, it's not to be added, right @GemmaTuron

exactly, all data has been eventually generated experimentally, so it is not that relevant to collect this information

Zainab-ik commented 2 months ago

General fields that do not add information;

GemmaTuron commented 2 months ago

Hi @Zainab-ik

I agree with most of them but MACCS keys are a different type of descriptor. IF the model is using RDKIT descriptors we should annotate that, if it is using MACCS we should annotate it and maybe we should think if we want to annotate all the different descriptors used

Zainab-ik commented 2 months ago

That's right. The only challenge is MACCS and RDKIT are the only descriptors present in OLS that can be annotated.