ersilia-os / ersilia

The Ersilia Model Hub, a repository of AI/ML models for infectious and neglected disease research.
https://ersilia.io
GNU General Public License v3.0
189 stars 123 forks source link

🦠 Model Request: DRKG_COVID19 #752

Closed russelljeffrey closed 4 months ago

russelljeffrey commented 12 months ago

Model Name

DRKG_COVID19

Model Description

Drug-Repurposing for COVID-19

Slug

COVID-19-Drug-Repurposing

Tag

COVID-19, Drug Repurposing Knowledge Graph

Publication

https://arxiv.org/abs/2007.10261v1

Source Code

https://github.com/gnn4dr/DRKG

License

Apache

Inyrkz commented 7 months ago

Another potential issue I noticed, is that the original model code depends on the entity ID (which depends on the drug bank ID) to predict at the final stage.

We want our users to be able to input SMILES or a list of SMILES, and get the edge score of the SMILES. Based on the author's code, we will need another step in the pipeline to extract the drugbank ID (so we can get the entity ID).

Please correct me if I'm wrong.

GemmaTuron commented 6 months ago

Hi @Inyrkz

Let's start by deciding which of the three embeddings we will keep. I'll try to detail here step by step how I would go about it:

  1. From the DrugBank dataset for which we can get the real embeddings, create a train_test split (20% test). I suggest saving the train test split in a data folder for the moment for easy following the next steps
  2. Train three models (with morgan fps, morgan counts and ersilia embeddings) that, given a SMILES, output a embedding (400 long). Use the training set and get the predictions on the test set
  3. Validate the performance of each of the embeddings by calculating the edge scores of the molecules in the test set using the original embedding and each of the predicted ones -- I again suggest saving the results as .csv files for easyness
  4. Once you have the edge scores of the original and each of the surrogate models, create 3 scatter plots that correlate the original edge score with each of the surrogate scores (we should see that the scores align in the diagonal of the scatter plot, more or less). Metrics like R2 can be calculated, as we do for regression models
  5. With the best method decided, train and save a final model that given any smiles, outputs the 400X embedding - use the full dataset (train+test) to get this final model, so we can push model performance to its maximum.

I hope this is clear!

Inyrkz commented 6 months ago

Yes, this is clear.

For no.3, since the edge score (the scores variable) is an array for each molecule, should I get the average of the array values, so we have one value of edge score for each drug molecule?

GemmaTuron commented 6 months ago

If I understood the code correctly, we get an edge score for each of the disease IDs in the list (first cell of the notebook) - can you check what is the edge score the authors are giving at the end? Are they averaging it?

Inyrkz commented 6 months ago

I didn’t know how to explain what I was doing. I will explain my thought process.

I analyzed the variables. The scores_per_disease variable has a length of 68. We are looking at 34 COVID-19 diseases and 2 treatments. 2 x 34 gives 68. The scores_per_disease is a list containing 68 lists. Each list in it has 8104 elements. This represents the scores for each of the drugs (as there are 8104 drugs) for a specific disease & treatment combo. And there are 68 possible disease & treatment combo.

The scores variable from this code scores = th.cat(scores_per_disease) flattens the scores_per_disease into one tensor with 551,072 elements. That’s 68 x 8104.

I calculated the mean score for each drug (along the axis 0). I couldn’t compare the scores directly with what the authors got, because they sorted their result in descending order. The first drug they have on their list is Compound::DB00811 with an edge score of -0.214167.

I searched for the Compound::DB00811 drug in the drug_list to know it’s index. The index is 211. I used the index to find the average_score. It was -3.3036. I don’t think the authors used the mean.

The notebook is here

Inyrkz commented 6 months ago

Hi @Inyrkz

Let's start by deciding which of the three embeddings we will keep. I'll try to detail here step by step how I would go about it:

  1. From the DrugBank dataset for which we can get the real embeddings, create a train_test split (20% test). I suggest saving the train test split in a data folder for the moment for easy following the next steps
  2. Train three models (with morgan fps, morgan counts and ersilia embeddings) that, given a SMILES, output a embedding (400 long). Use the training set and get the predictions on the test set
  3. Validate the performance of each of the embeddings by calculating the edge scores of the molecules in the test set using the original embedding and each of the predicted ones -- I again suggest saving the results as .csv files for easyness
  4. Once you have the edge scores of the original and each of the surrogate models, create 3 scatter plots that correlate the original edge score with each of the surrogate scores (we should see that the scores align in the diagonal of the scatter plot, more or less). Metrics like R2 can be calculated, as we do for regression models
  5. With the best method decided, train and save a final model that given any smiles, outputs the 400X embedding - use the full dataset (train+test) to get this final model, so we can push model performance to its maximum.

I hope this is clear!

I'll write the code to do this (maybe using the average score), then I'll modify it when we figure out how the authors compiled the final edge score result.

GemmaTuron commented 6 months ago

I am looking at it @Inyrkz , meanwhile let's prep the code as you suggest while we figure out the value the authors are using, thanks!

GemmaTuron commented 6 months ago

Given these two lines:

proposed_dids = dids[topk_indices]
proposed_scores = scores[topk_indices]

Could it be that they are simply giving the best edge score associated to a drug? They take 100 indexes from the tensor of lenght 551072

Inyrkz commented 6 months ago

Given these two lines:

proposed_dids = dids[topk_indices]
proposed_scores = scores[topk_indices]

Could it be that they are simply giving the best edge score associated to a drug? They take 100 indexes from the tensor of lenght 551072

I had an idea last night. I just tried it, and it works. The authors are using the best edge score for the drugs from all 68 combos. That means they are using the maximum score, instead of the average score. I tested this on the first two drugs Compound::DB00811 and Compound::DB00993 and it gave the same output the authors got, -0.21416780352592468 and -0.8350887298583984.

GemmaTuron commented 6 months ago

Indeed, that is what I was thinking as well. So, we need to get all the edge scores per compound and keep the best one as final result. Now, for comparing how good our embedding does to the real embedding, we need to make sure we are comparing the score for the same edge. So, I would not flatten the tensor that we get, and compare, for example, the results for only one of the combos, if that makes sense..

Inyrkz commented 6 months ago

Are you saying we should not take the best edge score, since that may be in different rows for each drug (out of the 68 combos), and I should just look at only one of the combos, perhaps the first combo (Disease::SARS-CoV2 E and Hetionet::CtD::Compound:Disease), to make sure the result isn't misleading?

GemmaTuron commented 6 months ago

What we want to do now is compare the results of the original embedding with the results we get when we use our embeddings (any of the three of them) - so we can check we are indeed getting good results. To that end, we do not care about the "best" molecules particularly, what we want is to make sure the real vs predicted edge scores are similar. To simplify, I suggest taking only one combo and comparing the results of the test set on that single combo (around 1500 molecules more or less?) If you are unsure about it ping me and we can have a quick chat!

Inyrkz commented 6 months ago

This is the notebook for the morgan fingerprint model.

GemmaTuron commented 6 months ago

@Inyrkz good job! The plot might be easier to visualise as follows:

plt.scatter(x=real_edge_score, y = pred_edge_score) This way we will see if x and y correlate, and we can plot all the datapoints in a single graph

But there looks there is quite a difference ... let's see if that is reproduced with the rest of the embeddings

Inyrkz commented 6 months ago

@Inyrkz good job! The plot might be easier to visualise as follows:

plt.scatter(x=real_edge_score, y = pred_edge_score) This way we will see if x and y correlate, and we can plot all the datapoints in a single graph

But there looks there is quite a difference ... let's see if that is reproduced with the rest of the embeddings

Thank you, @GemmaTuron I've added the scatter plot to the notebook. I'll reproduce the code for the other embeddings.

Inyrkz commented 6 months ago

These are the other notebooks: Morgan Fingerprint count Ersilia descriptor

GemmaTuron commented 6 months ago

We'll talk tomorrow about the results and what can we do next

Inyrkz commented 6 months ago

Alright 👍🏼

On Wed, Dec 20, 2023, 9:10 PM gemmaturon @.***> wrote:

We'll talk tomorrow about the results and what can we do next

— Reply to this email directly, view it on GitHub https://github.com/ersilia-os/ersilia/issues/752#issuecomment-1865068528, or unsubscribe https://github.com/notifications/unsubscribe-auth/ANGRFV7PWVTYOHTBJZENFB3YKNA2ZAVCNFSM6AAAAAA2TU5TOGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNRVGA3DQNJSHA . You are receiving this because you were mentioned.Message ID: @.***>

Inyrkz commented 6 months ago

I've organised the repo by adding the data you need to reproduce the code.

I've updated the morgan_fingerpring notebook to show the Pearson coefficient and regression plot. I'll also make predictions on the training set to see what it looks like.

Then do the same for the Morgan fingerprint count embeddings and Ersilia embeddings.

GemmaTuron commented 6 months ago

mmm we are not doing great with the embedding, perhaps it is worth a try to use the XGBoost multioutput regressor?

Inyrkz commented 6 months ago

Yup, I'll do that

On Thu, Dec 21, 2023, 3:32 PM gemmaturon @.***> wrote:

mmm we are not doing great with the embedding, perhaps it is worth a try to use the XGBoost multioutput regressor?

— Reply to this email directly, view it on GitHub https://github.com/ersilia-os/ersilia/issues/752#issuecomment-1866373791, or unsubscribe https://github.com/notifications/unsubscribe-auth/ANGRFVZ677ZL35RAO363LADYKRB7NAVCNFSM6AAAAAA2TU5TOGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNRWGM3TGNZZGE . You are receiving this because you were mentioned.Message ID: @.***>

Inyrkz commented 6 months ago

These are the other notebooks: Morgan Fingerprint count Ersilia descriptor

I've updated the other notebooks too.

The next step is to try the XGBoost algorithm.

Inyrkz commented 6 months ago

I’m using XGBRegressor as the base estimator. I found the MultiOutputRegressor wrapper from Scikit-Learn that allows us to extend the XGBRegressor to handle multi-output problems. The wrapper creates a separate instance of the base regressor for each target in the y_train array. It allows us to extend single-output regression models to handle multiple targets without explicitly training separate models for each target (400 embeddings output).

I’ve done the first experiment with this. It looks promising. The test result improved a little. The training result is good (the Pearson coefficient ranges from ~79-83), compared to the keras-tuner (with Pearson coefficient ~40-50).

The average train R-squared Score: 59.15%. The average test R-squared Score: 6.17%.

Since the training data is much better, it seems the model is overfitting. I’ll try experimenting more to see if I can improve the results of the model. First, let me see how the Morgan fingerprint count and ersilia embedding features perform.

The notebook is here.

GemmaTuron commented 6 months ago

Good job Ini, When you say the training data is much better, what do you mean? The training data is the same regardless of the method used

Inyrkz commented 6 months ago

Thank you, Gemma.

I meant to say the training result. That's the model performs better on the training set, compared to the keras-tuner model.

On Fri, Dec 22, 2023, 4:55 PM gemmaturon @.***> wrote:

Good job Ini, When you say the training data is much better, what do you mean? The training data is the same regardless of the method used

— Reply to this email directly, view it on GitHub https://github.com/ersilia-os/ersilia/issues/752#issuecomment-1867840074, or unsubscribe https://github.com/notifications/unsubscribe-auth/ANGRFV42DTHXIX62LY6LDALYKWUOJAVCNFSM6AAAAAA2TU5TOGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNRXHA2DAMBXGQ . You are receiving this because you were mentioned.Message ID: @.***>

GemmaTuron commented 6 months ago

XGBoost is working better so we will focus on that - maybe check Optuna for hyperparameter optimization. Let's see if we can get a good model with that. We also see that we get a bimodal distribution in many cases (two clouds) - so there is the option to convert the embedding to a binary (0-1) classification for each embedding value and try to predict that which is much easier.

Inyrkz commented 6 months ago

The XBoost model trained on the ersilia embeddings, gives the best training result, although it overfits.

The only challenge here is that it takes longer to convert the SMILES to embeddings (I saved them to a file to save time here.) and it takes longer to train the model (~6hrs) without hyperparameter tuning.

Inyrkz commented 6 months ago

I've written the code to perform model fine-tuning with Optuna. The fine-tuning takes time, I'll let you know how this goes.

Inyrkz commented 6 months ago
Model Embedding No of records (fine-tuning) Fine-tuning time per trial No. of trials Model training time (on all records) Total training time Average Train R-squared score Average Test R-squared score
Ersilia embedding (no fine-tuning) 7045 - - ~6 hrs ~6 hrs 99% -5%
Morgan fingerprint 1000 25 mins 5 trials 2 hrs ~4 hrs 16% 8%
Ersilia embedding 500 ~2 hrs 3 trials ~6 hrs ~12 hrs 13% 5%
Morgan fingerprint count (I didn't finish training) 7045 ~2 hours 3 trials - - - -
miquelduranfrigola commented 6 months ago

Thanks @Inyrkz this is very useful. A few comments:

Inyrkz commented 6 months ago

@miquelduranfrigola

Yes, we need all 7045 samples. No, I didn't add the tree_method='hist' argument. I just added these arguments for training with GPU on Google Colab

"tree_method": "gpu_hist",  # Enable GPU acceleration
"device": "cuda",  # Specify GPU as the device

Yes, from the article I read, the trials need to be set to at least 30 to get a good result.

GemmaTuron commented 6 months ago

Thanks @Inyrkz Let me know if the training times improve on Google Colab with GPU

Inyrkz commented 6 months ago

Hi @GemmaTuron,

I've updated the codes to the notebooks you'll need for training with the workload. I'm training using Google Colab, to see if the training time will improve.

The data for the ersilia notebook is in this google drive

Inyrkz commented 6 months ago

@miquelduranfrigola

Okay, from the warning

UserWarning: [12:08:29] WARNING: /workspace/src/common/error_msg.cc:27: The tree method `gpu_hist` is deprecated since 2.0.0. To use GPU training, set the `device` parameter to CUDA instead.

E.g. tree_method = "hist", device = "cuda"

I'll have to update the tree_method to hist instead of gpu_hist.

Inyrkz commented 6 months ago

@GemmaTuron, the training time is better on Colab with GPU. It takes ~25 mins to train each trial using all 7045 records on the Ersilia embeddings (compared to when it took ~2 hours per trial for only 500 records.)

It may take up to 10 hours to train for 30 trials. I'm currently in the 4th trial of optimization.

GemmaTuron commented 6 months ago

Hi @Inyrkz !

That's good but due to the limits on GPU time on google colab it would be a pain for you to run this, as you need to keep the computer active or it will disconnect the runtime. Can you list the packages I need to install to run the notebook (and make sure all the paths are pointing to the right github folders so I won't make a mess :) ) and I'll set it in our workstation

Inyrkz commented 6 months ago

Yes

The paths point to the right Github folder. Only these files are unavailable (for the ersilia embeddings)

X_train = np.loadtxt(os.path.join(directory, 'X_train_ersilia.txt'))
y_train = np.loadtxt(os.path.join(directory, 'y_train_ersilia.txt'))
X_test = np.loadtxt(os.path.join(directory, 'X_test_ersilia.txt'))
y_test = np.loadtxt(os.path.join(directory, 'y_test_ersilia.txt'))

You'll have to download them manually from the Google Drive folder.

GemmaTuron commented 6 months ago

Hi @Inyrkz

When working on a remote workstation I cannot download files from Google Drive as I only have access to the computer via CLI. Can you add them to the repository, if they are too large use git-lfs.

Thanks!

Inyrkz commented 6 months ago

Alright, I'll do that.

On Wed, Jan 3, 2024, 8:59 AM gemmaturon @.***> wrote:

Hi @Inyrkz https://github.com/Inyrkz

When working on a remote workstation I cannot download files from Google Drive as I only have access to the computer via CLI. Can you add them to the repository, if they are too large use git-lfs.

Thanks!

— Reply to this email directly, view it on GitHub https://github.com/ersilia-os/ersilia/issues/752#issuecomment-1874973559, or unsubscribe https://github.com/notifications/unsubscribe-auth/ANGRFV4MSYRPUQXICLFF4RDYMUFVTAVCNFSM6AAAAAA2TU5TOGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNZUHE3TGNJVHE . You are receiving this because you were mentioned.Message ID: @.***>

Inyrkz commented 6 months ago

I keep getting this error while trying to push with git-lfs

(base) affiah@affiah-Latitude-E7450:~/Desktop/eos3nl8$ git push origin main
Uploading LFS objects: 100% (15/15), 432 MB | 0 B/s, done.                                                                                          
Enumerating objects: 62, done.
Counting objects: 100% (62/62), done.
Delta compression using up to 4 threads
Compressing objects: 100% (43/43), done.
error: RPC failed; HTTP 400 curl 22 The requested URL returned error: 400
send-pack: unexpected disconnect while reading sideband packet
Writing objects: 100% (44/44), 119.12 MiB | 816.00 KiB/s, done.
Total 44 (delta 9), reused 0 (delta 0), pack-reused 0
fatal: the remote end hung up unexpectedly
Everything up-to-date

Here's what I've tried:

(base) affiah@affiah-Latitude-E7450:~/Desktop/eos3nl8$ git config --get-regexp '^remote..*.git-lfs$' remote.origin.git-lfs https://github.com/Inyrkz/eos3nl8.git/info/lfs


- Increasing the buffer size using this command `git config http.postBuffer 524288000`

But I'm still getting the error.
GemmaTuron commented 6 months ago

Hi @Inyrkz

How big are the files? Did you track them individually in the .gitattributes file?

Inyrkz commented 6 months ago

The data is about 300 MB.

I’ve been able to resolve the issue.

Initially, I was tracking all csv, txt, npy and tsv files with git-lfs.

I cloned the repo again and tracked only the txt files. I think the problem was because the large files were still in the git history, even after adding them to git-lfs.

GemmaTuron commented 6 months ago

Hi @Inyrkz,

Great, good news. I was thinking I'll need to do modifications to the code, since the best will be to leave things running using the nohup command (in case you have never used it, I can set up runs remotely and safely disconnect while the remote end does not hang). This means we need the training script in a .py file. To make things easier, we should probably merge your code into the ersilia-os repo, to which I can make pushes directly. Can you open a PR? Also if you want to give a try to convert part of your code into a .py script, this is also very good practice for data science, moving away from notebooks and into scripts, so let me know!

Inyrkz commented 6 months ago

Yes, I can open a pull request.

For the training script, should I focus on the model training part of the code only to the point where I save the best model to a file? (That is, ignoring the edge score part of the code.)

And do I do this for all three embeddings?

GemmaTuron commented 6 months ago

Hi @Inyrkz

That's it, a simple .py script that trains and saves the models, I'll then set it up for running in the workstation

Inyrkz commented 6 months ago

Alright, got it

Inyrkz commented 6 months ago

I've created the training scripts.

GemmaTuron commented 6 months ago

@Inyrkz looking at the code, I do not understand why are you using the sklearn MultiOutputRegressor. That will train a single xgboost regressor per target (i.e 400) whereas what we want is a single model able to predict the 400 outputs, so the native xgboost regressor. You can see an example explained here: https://forecastegy.com/posts/xgboost-multi-output-regression-python/#training-xgboost-with-native-multi-output-regression-support I am referring to this line of code: best_multioutput_regressor = MultiOutputRegressor(best_xgb_regressor)

Inyrkz commented 6 months ago

Okay, thank you for pointing it out. I'll adjust the codes.

Inyrkz commented 6 months ago

@Inyrkz looking at the code, I do not understand why are you using the sklearn MultiOutputRegressor. That will train a single xgboost regressor per target (i.e 400) whereas what we want is a single model able to predict the 400 outputs, so the native xgboost regressor. You can see an example explained here: https://forecastegy.com/posts/xgboost-multi-output-regression-python/#training-xgboost-with-native-multi-output-regression-support I am referring to this line of code: best_multioutput_regressor = MultiOutputRegressor(best_xgb_regressor)

@GemmaTuron I've updated the codes.