ersilia-os / ersilia

The Ersilia Model Hub, a repository of AI/ML models for infectious and neglected disease research.
https://ersilia.io
GNU General Public License v3.0
189 stars 123 forks source link

🦠 Model Request: DRKG_COVID19 #752

Closed russelljeffrey closed 4 months ago

russelljeffrey commented 11 months ago

Model Name

DRKG_COVID19

Model Description

Drug-Repurposing for COVID-19

Slug

COVID-19-Drug-Repurposing

Tag

COVID-19, Drug Repurposing Knowledge Graph

Publication

https://arxiv.org/abs/2007.10261v1

Source Code

https://github.com/gnn4dr/DRKG

License

Apache

GemmaTuron commented 11 months ago

Hi @russelljeffrey

This is a very interesting model, thanks. It is a complex network, and we cannot integrate it as it is but we rather need to select specific parts of the network, for example the molecular embeddings they use, or the specific repurposing model with the pretrained parameters. Can you clarify what your referred to?

russelljeffrey commented 11 months ago

Hi @GemmaTuron, When I looked at the article and the code, the emphasis was mainly on discovering drugs for COVID-19 using Neural Networks in PyTorch. So I'm guessing that takes precedence over other parts of the network. There is also a comprehensive analysis of Entity Embedding Similarity using t-sne algorithm. Please tell me if you think there are other more important parts of the network that must be included.

GemmaTuron commented 11 months ago

Could you clarify @russelljeffrey if you suggest this model because you'd like to see it in the Hub for your work or you want to incorporate it yourself? I would not recommend selecting this as a start model since it is not the simplest model to work on.

russelljeffrey commented 11 months ago

Hi again @GemmaTuron . As you have already realized, its usefulness and complexity has drawn my attention. That is why I would like to implement it myself based on its repository. And yes it may not be quite easy but I am certain I can manage.

GemmaTuron commented 11 months ago

Hi @russelljeffrey

Thanks for your interest, but Ersilia maintaners are stretched very thin and we cannot provide much support at this time, so I'll keep this model on hold while you work only on one model, and we'll take it from there. I hope you understand

Inyrkz commented 7 months ago

@GemmaTuron, I've successfully set up the latest version of Ersilia. I want to test the original code of this model to make sure it works on my system.

Inyrkz commented 7 months ago

@GemmaTuron The network is complex, but I focused on the COVID-19 drug repurposing part of it. I had to make some adjustments to the jupyter notebook to make the code run. I've gotten the original model to run successfully on my local system.

GemmaTuron commented 7 months ago

fantastic @Inyrkz

I'll approve the model and if you want you can attempt the incorporation, make sure to follow the (multiple) steps from the Documentation

GemmaTuron commented 7 months ago

/approve

github-actions[bot] commented 7 months ago

New Model Repository Created! 🎉

@russelljeffrey ersilia model respository has been successfully created and is available at:

🔗 ersilia-os/eos3nl8

Next Steps ⭐

Now that your new model respository has been created, you are ready to start contributing to it!

Here are some brief starter steps for contributing to your new model repository:

Note: Many of the bullet points below will have extra links if this is your first time contributing to a GitHub repository

Additional Resources 📚

If you have any questions, please feel free to open an issue and get support from the community!

Inyrkz commented 7 months ago

@GemmaTuron, I studied the model flow. I have finished writing the main.py script to run the model. I've been able to successfully run the Ersilia model locally using the bash script run.sh

GemmaTuron commented 7 months ago

Hi @Inyrkz !

This model is quite complex, I've had a close look at the network embedding with Miquel and we have decided the following:

Hopefully we are able to manage a surrogate model to predict the anti covid19 potential of new drugs (not the ones already present in the embedding!)

GemmaTuron commented 7 months ago

FIRST STEP: We want to be able to pass ANY molecule as a SMILES, and get its anti COVID19 potential. Now the network only accepts drugs that are actually already in the general DRKG graph, as we have just discussed. The first step will then be to get a csv file where each SMILES used in the pretrained notebook is associated to its embedding.

  1. The drugs are specified in the drug_repurpose/infer_drug.tsv, but identified only as their DrugBank ID. You need first to parse the drug bank id and get the smiles, using the file drugbank_info/drugbank_smiles.txt. So, basically, a csv file with two columns: DrugBankID and SMILES for the 8100 drugs that appear in infer_drug.tsv (keeping the infer_drug.tsv order)
  2. Get the embedding corresponding to each compound. The embedding is precalculated in the numpy array embed/DRKG_TransE_l2_entity.npy. This has almost 100k embeddings, so you need to identify which ones correspond to the drugs. Use the code in cells 6 and 7 from the notebook drug_repurpose/COVID-19_drug_repurposing.ipynb for that
  3. Get a .csv file that has as columns: DrugBankID - SMILES - Embedding (400 columns) for the 8100 Smiles in infer_drugs.tsv

How to work collaboratively? Use the same model repository (eos3nl8), create a notebook there under code for example and add a /data folder for the .csv files. Once you think you got to step 3, push to your fork and I'll review it!

Let me know if this is clear or too much info!

Inyrkz commented 7 months ago

To make sure I understand this;

Should I adjust the input parameter so that the end user only has to input the SMILES, instead of the DrugBank ID? Then, I take the SMILES passed by the user and get the corresponding DrugBank ID based on the drugbank_info/drugbank_smiles.txt file.

Also, should I account for edge cases, where I let the user know if the SMILES they input are not in the DRKG database?

For no 3. Get a .csv file that has as columns: DrugBankID - SMILES - Embedding (400 columns) for the 8100 Smiles in infer_drugs.tsv, after creating this CSV file, will I use it for anything else?

GemmaTuron commented 7 months ago

Yes; the end goal is that the user can input any smiles (not only the smiles of compounds in Drug Bank). This is what we are aiming for but it will require several steps. The first is to get this file I am suggesting in step 3, because we need to understand how the compounds available in the DRKG graph (ie. the DrugBank compounds) are being represented (embedded) in the model. Once we have that, we will be able to train a model that reproduces this embedding so we can pass a new SMILES which is not currently present in the graph. Does this make sense?

Inyrkz commented 7 months ago

Yes, this makes sense. I'll work on it.

Inyrkz commented 7 months ago

I've been able to create the CSV file. Not all the drugs in the infer_drugs.tsv file are in the drugbank_info/drugbank_smiles.txt file. There are 8104 drug in the infer_drugs.tsv file. Only 6521 drugs are common in the two files. The CSV file contains the embeddings of these drugs.

For the missing data, is there another source where I can get the SMILES from?

I also worked on another notebook where I used all the 8807 drugs in the drugbank_info/drugbank_smiles.txt file, ignoring the infer_drugs.tsv file, just in case.

Inyrkz commented 7 months ago

smiles embeddings preview

This is what it looks like.

GemmaTuron commented 7 months ago

Hi @Inyrkz !

Fantastic, a few questions:

GemmaTuron commented 7 months ago

Then we can move onto STEP 2 (this is where things get complicated) We need a surrogate model that allows us to, given any smiles, obtain the 400 embedding available in DRKG_TransE_l2_entity.npy. This means, a multi-regressor where the X is a SMILES and the Y is the embedding. Let me break it down into pieces

  1. Featurise SMILES: the SMILES string itself is not useful for ML tasks, so we need to convert it. We can start by using a simple Morgan fingerprint (calculated with RDKIT; see model eos4wt0 from Ersilia)
  2. Train a Keras Tuner model that, given a Morgan FPS, can predict the 400-embedding. For inspiration, see this piece of code. Use a train/test split to evaluate the model performance
  3. Save the final model and try to get an embedding for a new smiles!

Feel free to re use from my code as much as needed, and to try different keras-tuner parameters.

I know this is quite complex so take it slow, we can meet on Monday to see how far you've gotten and discuss next steps

Inyrkz commented 7 months ago
  • The 8800 SMILES are all in the DRKG_TransE_l2_entity.npy?

Yes, the 8800 SMILES are all in the DRKG_TransE_l2_entity.npy. I created another CSV file for them. The notebook is here. There are 8807 SMILES. I can't push the CSV file to GitHub because of the file size.

Inyrkz commented 7 months ago
  • The SMILES come from DrugBank (https://go.drugbank.com/) Unfortunately there is no straightforward API to interact with it. Could you manually check a few SMILES that: a) are in drugbank_smiles.txt but not in infer_drugs.txt, b) are in infer_drugs.txt but do not have a SMILES associated? To see if we pick something (for example, they relate to antibodies not small molecules)

I'll work on this.

There are 2286 SMILES in the drugbank_smiles that are not in the infer_drugs.txt. And there are 1583 SMILES in the infer_drugs.txt that are not in drugbank_smiles.txt.

Inyrkz commented 7 months ago

Then we can move onto STEP 2 (this is where things get complicated) We need a surrogate model that allows us to, given any smiles, obtain the 400 embedding available in DRKG_TransE_l2_entity.npy. This means, a multi-regressor where the X is a SMILES and the Y is the embedding. Let me break it down into pieces

  1. Featurise SMILES: the SMILES string itself is not useful for ML tasks, so we need to convert it. We can start by using a simple Morgan fingerprint (calculated with RDKIT; see model eos4wt0 from Ersilia)
  2. Train a Keras Tuner model that, given a Morgan FPS, can predict the 400-embedding. For inspiration, see this piece of code. Use a train/test split to evaluate the model performance
  3. Save the final model and try to get an embedding for a new smiles!

Feel free to re use from my code as much as needed, and to try different keras-tuner parameters.

I know this is quite complex so take it slow, we can meet on Monday to see how far you've gotten and discuss next steps

Sounds interesting. I'll give it a try.

GemmaTuron commented 7 months ago
  • The SMILES come from DrugBank (https://go.drugbank.com/) Unfortunately there is no straightforward API to interact with it. Could you manually check a few SMILES that: a) are in drugbank_smiles.txt but not in infer_drugs.txt, b) are in infer_drugs.txt but do not have a SMILES associated? To see if we pick something (for example, they relate to antibodies not small molecules)

I'll work on this.

There are 2286 SMILES in the drugbank_smiles that are not in the infer_drugs.txt. And there are 1583 SMILES in the infer_drugs.txt that are not in drugbank_smiles.txt.

Yes, I understood that. I am asking if you could go to the drugbank website and manually compare a few smiles that are not present in the drugbank smiles for example to understand if they might not be small molecules, maybe antibodies or combination therapies. Can you create a list of the id's for which we don't have smiles?

Thanks!

Inyrkz commented 7 months ago

Yes, that's what I'm doing.

On Fri, Dec 8, 2023, 1:53 PM gemmaturon @.***> wrote:

  • The SMILES come from DrugBank (https://go.drugbank.com/) Unfortunately there is no straightforward API to interact with it. Could you manually check a few SMILES that: a) are in drugbank_smiles.txt but not in infer_drugs.txt, b) are in infer_drugs.txt but do not have a SMILES associated? To see if we pick something (for example, they relate to antibodies not small molecules)

I'll work on this.

There are 2286 SMILES in the drugbank_smiles that are not in the infer_drugs.txt. And there are 1583 SMILES in the infer_drugs.txt that are not in drugbank_smiles.txt.

Yes, I understood that. I am asking if you could go to the drugbank website and manually compare a few smiles that are not present in the drugbank smiles for example to understand if they might not be small molecules, maybe antibodies or combination therapies. Can you create a list of the id's for which we don't have smiles?

Thanks!

— Reply to this email directly, view it on GitHub https://github.com/ersilia-os/ersilia/issues/752#issuecomment-1847115888, or unsubscribe https://github.com/notifications/unsubscribe-auth/ANGRFV2HV3YVJMDVHLL2S33YIMES3AVCNFSM6AAAAAA2TU5TOGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNBXGEYTKOBYHA . You are receiving this because you were mentioned.Message ID: @.***>

Inyrkz commented 7 months ago

Here are my observations.

What we need is on this page (https://go.drugbank.com/data_packages/drug_repurposing/data_modules/drugs). The two tables we need are the Accession Numbers and the Drug Calculated Properties. The Accession Numbers table with 915910 rows has the columns ID( e.g 0) and Number (e.g DB00001). The Drug Calculated Properties with 12688 rows has the columns drug_id (e.g 0) and SMILES. We can merge the two tables based on ID and drug_id to extract the SMILES.

Only a preview is available.

Inyrkz commented 7 months ago

While doing a manual search for the drug, DB13633, by using the link, I discovered that it is an experimental drug. I also saw this “This drug entry is a stub and has not been fully annotated. It is scheduled to be annotated soon.” This is a small molecule.

Some of the drugs without SMILES are either in the experimental or Investigative group category. The ones in the approved group have missing information about the SMILES.

Inyrkz commented 7 months ago

I also noticed that if you reference other sites like ChEMBL, you will find the canonical SMILE for the compound. For example, for this drug ID, DB00109, if you check the External Links section of the page, you’ll see other references to the drug. For example, if you go to the ChEMBL link, you will find the SMILE of the drug. You can also find the SMILES in the Wikipedia source. But the SMILES in the two pages do not match. The type isn’t a small molecule. The type is Biotech.

Inyrkz commented 7 months ago

The SMILES of some drugs are available. For example, the drug ID, DB03518, SMILES is available. The SMILES matches with the one on ChEMBL. It is a small molecule.

Inyrkz commented 7 months ago

This CSV file contains the Drug IDs that don't have SMILES.

GemmaTuron commented 6 months ago

Thanks @Inyrkz !

I will check the durg IDs and share more about them in the meeting tomorrow for everyone. For the part of buulding a KerasTuner model, let's meet to discuss!

Inyrkz commented 6 months ago

@GemmaTuron, I've finished training a model to predict the SMILES embeddings of a drug, using Morgan Fingerprint as a descriptor. The notebook is here.

I used the CSV file with all the drugs (8807) from the drugbank_smiles.txt file, so we have more data for training.

There are 7045 records in the training set and 1762 in the test set.

The train mean squared error is: 0.1887 The test mean squared error is: 0.2246

GemmaTuron commented 6 months ago

Hi @Inyrkz

That looks quite good, did you have to do any modification to the keras tuner class for the multi regressor?

Inyrkz commented 6 months ago

Yes, I only changed the directory names. Every other part worked fine.

GemmaTuron commented 6 months ago

Fantastic. Now, we used a very simple fingerprint. It would be good to evaluate how other kinds of fingerprints work. We must remember this is just one of the steps of the final model, so we cannot choose a very large fingerprint or it will maybe take too long to calculate. Could you evaluate & compare the performances of the models if we use:

Inyrkz commented 6 months ago

Sounds good

Inyrkz commented 6 months ago

@GemmaTuron Using the Morgan fingerprint counts, here are the mean squared errors.

Train Loss: 0.20003868639469147 Test Loss: 0.23136216402053833

It's not as good as the Morgan Fingerprint descriptor.

I'm working on using the Mordred descriptors.

GemmaTuron commented 6 months ago

Hi @Inyrkz !

How are you doing with the ersilia embeddings? Once you have the three models (morgan, morgan_counts and eosce) we should compare their performance in the Knowledge graph. For that, we should take the code in the notebook COVID-19_drug_repurposing.ipynb and, instead of passing the original embeddings, calculate the embeddings using your model and pass them. We then can compare the scores they get with the original embeddings and the scores we get with our embeddings. Does this make sense? Let me know once you are working on that if you need further help!

Inyrkz commented 6 months ago

Hi @GemmaTuron,

For the ersilia embeddings. This is the result I got.

Train Loss (Mean Squared Error): 0.21272526681423187 Test Loss (Mean Squared Error): 0.22930985689163208

Inyrkz commented 6 months ago

Hi @Inyrkz !

How are you doing with the ersilia embeddings? Once you have the three models (morgan, morgan_counts and eosce) we should compare their performance in the Knowledge graph. For that, we should take the code in the notebook COVID-19_drug_repurposing.ipynb and, instead of passing the original embeddings, calculate the embeddings using your model and pass them. We then can compare the scores they get with the original embeddings and the scores we get with our embeddings. Does this make sense? Let me know once you are working on that if you need further help!

Yes, this makes sense.

GemmaTuron commented 6 months ago

perfect, so let's see how these three models do when we use them in the network. Let's start by only one model (the one you prefer) and try to modify the colab notebook to, given an smiles, convert it to the embedding and calculate its distance to the COVID-related terms specified in the notebook. Once we can do that, we will think of a few comparisons (for example, training a model on the lowest scoring molecules and predicting the values for the top scoring, to make thinkgs a bit more "difficult" and really assess how good the model for the embeddings is doing)

Inyrkz commented 6 months ago

Okay, I'll start with the Morgan Fingerprint model (it was the fastest).

For the distance to the COVID-related terms, is that the edge score (the metric they used in the notebook)?

GemmaTuron commented 6 months ago

yes, that's it, the edge score is a measure of whether two nodes in the knowledge graph are close to one another

Inyrkz commented 6 months ago

@GemmaTuron I've made progress in using all three descriptors in predicting the edge scores. The notebook is here.

I could only get the scores as a list rather than a single value. A part of the authors' code is confusing because they used the entity ID in the final step to get a single edge score. That is cells 10 -12. I stopped at cell 9, using scores = th.cat(scores_per_disease) as the output. I’m not sure how to approach that part using only the SMILES as the input.

For the next part, I want to use R squared to compare the original scores with the scores of each descriptor model to see how similar they are. My hunch says the Morgan FingerPrint Counts is the closest to the original score. I want to be sure of this edge score issue before proceeding.

I’m trying to use only one SMILES as input to see what the scores output will look like. I’m having some errors, but I’m working on it.

Inyrkz commented 6 months ago

@GemmaTuron, I have updated the notebook..

It outputs the top 100 drugs and their edge scores.

I get an error when running the ersilia embedding section. The error is new. It seems it can only output the top 68 before showing an IndexError.

GemmaTuron commented 6 months ago

Hi @Inyrkz

I need to have a closer look but on a quick note, seems here:

# predict the embeddings of the Ersilia descriptor
ersilia_embeddings = model.predict(ersilia_descriptor[0])
print(ersilia_embeddings)

you are only generating embeddings for 68 molecules so it does not find more than 68 and hence the error? I could be wrong though!

Inyrkz commented 6 months ago

I'll have a look at it again tomorrow.

On Thu, Dec 14, 2023, 6:17 PM gemmaturon @.***> wrote:

Hi @Inyrkz https://github.com/Inyrkz

I need to have a closer look but on a quick note, seems here:

predict the embeddings of the Ersilia descriptor

ersilia_embeddings = model.predict(ersilia_descriptor[0]) print(ersilia_embeddings)

you are only generating embeddings for 68 molecules so it does not find more than 68 and hence the error? I could be wrong though!

— Reply to this email directly, view it on GitHub https://github.com/ersilia-os/ersilia/issues/752#issuecomment-1856257020, or unsubscribe https://github.com/notifications/unsubscribe-auth/ANGRFV5UGCIG762PKSMKPNLYJMYEFAVCNFSM6AAAAAA2TU5TOGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNJWGI2TOMBSGA . You are receiving this because you were mentioned.Message ID: @.***>

Inyrkz commented 6 months ago

I've been able to fix the issue. It was the shape of the embeddings.

GemmaTuron commented 6 months ago

Hi @Inyrkz !

At which stage are you? after fixing the embedding size, how is the model working? are we able to get results similar to the ones from the original model?

Inyrkz commented 6 months ago

Hi @GemmaTuron,

I've completed the models for all the 3 embeddings. The models get the input drug data and give the top 100 drugs (with their edge scores) that can be used to fight COVID-19, like the original model.

I noticed that the different embeddings give different top 100 drugs and edge scores. They are different from what the original model is predicting.

I'm not sure how to go about the comparison of each embedding result to the original embedding result. I was thinking of making predictions for all the drugs instead of only the top 100, so it will be easier to see which embedding (morgan fingerprint, morgan fingerprint count, and ersilia) gives the closest result to the original model.