ersilia-os / ersilia

The Ersilia Model Hub, a repository of AI/ML models for infectious and neglected disease research.
https://ersilia.io
GNU General Public License v3.0
198 stars 128 forks source link

🦠 Model Request: DRKG_COVID19 #752

Closed russelljeffrey closed 5 months ago

russelljeffrey commented 1 year ago

Model Name

DRKG_COVID19

Model Description

Drug-Repurposing for COVID-19

Slug

COVID-19-Drug-Repurposing

Tag

COVID-19, Drug Repurposing Knowledge Graph

Publication

https://arxiv.org/abs/2007.10261v1

Source Code

https://github.com/gnn4dr/DRKG

License

Apache

GemmaTuron commented 7 months ago

Hi @Inyrkz

I have refactored your scripts to reuse the code using some classes, please have a look carefully because I've done it fairly quickly and I might have done some mistakes. The three files I am using to run the XGBoost training are:

Let me know what you think of this code!

Gemma

Inyrkz commented 7 months ago

Wow, that's cool. I'll check it.

Inyrkz commented 7 months ago

The code looks good

GemmaTuron commented 7 months ago

Hi @Inyrkz , I've updated the results of the run in /checkpoints/models I did not run any predictions but everything should work!

Inyrkz commented 7 months ago

@GemmaTuron, I can't find the XGBoost models in the /checkpoints directory.

GemmaTuron commented 7 months ago

sorry, solved

Inyrkz commented 7 months ago

Thanks

Inyrkz commented 7 months ago
Model Embedding Average Train R-squared score (DRKG) Average Test R-squared score (DRKG) Edge Score Test R-squared score Edge Score Test Mean Squared Error
Morgan fingerprint 40% 10% -7.3807% 6.9617
Morgan fingerprint count 39% 12% -7.5576% 7.1087
Ersilia embedding 67% 9% -8.0465% 7.5148

I've evaluated the fine-tuned XGBoost models. The performance on the test set hasn't improved. The table above shows the result of the evaluation of DRKG embedding predictions and edge score prediction.

GemmaTuron commented 7 months ago

ok thanks. We will give one last try increasing the optimization with optuna to 100 rounds. I'll let you know when the models are updated in the repo !

miquelduranfrigola commented 7 months ago

What is the current status?

GemmaTuron commented 7 months ago

Hi @Inyrkz and @miquelduranfrigola The new models trained with 100 rounds are uploaded in the repository - Ini, they are quite heavy so you might wanna try running directly in codespaces if your connection is unstable

Inyrkz commented 7 months ago

Thank you for the suggestion. I'll evaluate them using codespace.

Inyrkz commented 7 months ago

The kernel keeps crashing when I want to evaluate the morgan fingerprint model. This is the error message: The Kernel crashed while executing code in the the current cell or a previous cell. Please review the code in the cell(s) to identify a possible cause of the failure. Click [here](https://aka.ms/vscodeJupyterKernelCrash) for more info. View Jupyter [log](command:jupyter.viewOutput) for further details.

This is the log output:

10:40:32.615 [info] Kernel acknowledged execution of cell 7 @ 1705315232615
10:40:32.654 [info] End cell 7 execution after 0.039s, completed @ 1705315232654, started @ 1705315232615
10:40:32.657 [info] Kernel acknowledged execution of cell 8 @ 1705315232657
10:40:38.551 [error] Disposing session as kernel process died ExitCode: undefined, Reason: 
10:40:38.551 [info] Dispose Kernel process 9207.
10:40:39.432 [info] End cell 8 execution after -1705315232.657s, completed @ undefined, started @ 1705315232657

I'll try to download the models and evaluate them locally.

Inyrkz commented 7 months ago

My internet was unstable so I used Google Colab to evaluate the models.

The table below shows the results of the 100 trials XGBoost models

Model Embedding Average Train R-squared score (DRKG) Average Test R-squared score (DRKG) Notebook
Morgan fingerprint 48% 12% link
Morgan fingerprint count 45% 12% link
Ersilia embedding 75% 9% link

There was only a slight improvement in the models' performance.

miquelduranfrigola commented 7 months ago

OK, thanks @Inyrkz. This is so challenging.

@GemmaTuron, @Inyrkz - what is your opinion? Should we freeze/abandon this model? This seems to be very challenging, unfortunately, and I think that we've hit a dead end with the multioutput regression. The good thing is that we've learned quite a bit about multioutput XGBoost and Optuna, which we will use for sure.

If we want to "rescue" this model, I have one proposal based on similarity searches, but I am not sure how much extra effort we want to invest. Please let me know and we can discuss.

Inyrkz commented 7 months ago

How will the similarity search work?

You also suggested doing binarizing, to covert the embeddings to 1s and 0s (The only issue is that we will lose some valuable information)

miquelduranfrigola commented 7 months ago

It will be more efficient to discuss online. @GemmaTuron should we schedule a dedicated short meeting to reach a decision?

GemmaTuron commented 7 months ago

@miquelduranfrigola sure, maybe the week of the 29th Jan we can talk about this and close it

GemmaTuron commented 6 months ago

IS this the peer reviewed paper? https://www.nature.com/articles/s41598-023-30095-z

Inyrkz commented 6 months ago

IS this the peer reviewed paper? https://www.nature.com/articles/s41598-023-30095-z

I'm not sure. The paper looks interesting. This is the paper I started with: https://arxiv.org/abs/2007.10261

Inyrkz commented 5 months ago

@GemmaTuron @miquelduranfrigola

For the DRKG model, these are the notebooks. Model using 3 nearest neighbors: https://github.com/Inyrkz/eos3nl8/blob/main/model/framework/code/ersilia_knn_3n.ipynb

Model using 1 nearest neighbor: https://github.com/Inyrkz/eos3nl8/blob/main/model/framework/code/ersilia_knn_1n.ipynb

miquelduranfrigola commented 5 months ago

Hello @Inyrkz

This is great work, although (on a quick look) I don't think you are evaluating performance correctly. Could you qualitatively describe how you evaluate performance?

I think that, generally, it would be much easier if you would use the k-neighbors regressor, as opposed to avering yourself: https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsRegressor.html

Thanks!

miquelduranfrigola commented 5 months ago

@Inyrkz - on a related note: I want this work to be useful as much as possible. Could you please share with me a dataframe containing the known embeddings for the >8000 drugs or so? A CSV file will be ok. First column, smiles, next 400 columns, embedding.

I want to use this as part of the pharmacogx-embedings work that we are doing. Highly relevant stuff for Ersilia. Since in that project we are merely dealing with approved drug molecules, the known embeddings will suffice (i.e. no prediction is necessary).

Could you do this, @Inyrkz ? Thanks so much!

Inyrkz commented 5 months ago

Thanks for the feedback @miquelduranfrigola. I'll make use of K-neighbors regressor.

Inyrkz commented 5 months ago

Yes, I have the files you need.

These are the links

  1. https://github.com/Inyrkz/eos3nl8/blob/main/model/framework/code/data/train.csv
  2. https://github.com/Inyrkz/eos3nl8/blob/main/model/framework/code/data/test.csv
Inyrkz commented 5 months ago

For the performance evaluation

Mean Squared Error: I am comparing the y_test (original output embeddings) to the average_outputs (the prediction of the model, which is the average of the three closest output embeddings). I pass those two values (2D arrays) to the mean_squared_error function.

I did the same thing for the R-squared score and cosine similarity.

For the Euclidean distance, I flattened the arrays first before passing them to the distance.euclidean() function.

The visualizations are what we used before, where we do a row-wise comparison of the 400 output embeddings for a drug molecule. We check how similar the original embeddings are compared to the predicted embeddings.

Then we did a column-wise comparison, where we compare the first (second & third) embedding (original & predicted) of all the drugs in the test set.

miquelduranfrigola commented 5 months ago

Yes, I have the files you need.

These are the links

  1. https://github.com/Inyrkz/eos3nl8/blob/main/model/framework/code/data/train.csv
  2. https://github.com/Inyrkz/eos3nl8/blob/main/model/framework/code/data/test.csv

Thanks so much @Inyrkz - this is exactly what I need. I will definitely use them in the pharmacogx-embeddings work.

miquelduranfrigola commented 5 months ago

Hi @Inyrkz

Thanks for the explanation above - very useful.

So, let me summarize what I think we should do and let's be sure this is what you are doing:

  1. Using the kNN regressor (k=1 or k=3), predict, for each molecule in the test set, an array of 400 values. If the test set has "n" molecules, then you should have a y_hat matrix of nx400. Correspondingly, the y_true is nx400 too.
  2. Then, you can evaluate R2, MAE or whatever regression metric using the Sklearn functions. For example: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.r2_score.html already naturally handles multioutput y_true and y_hat. It is multioutput because we have 400 dimensions.

At this stage, let's not focus too much on the euclidean distance front.

About the column-wise comparison, simply randomly pick 5 columns of the 400 and plot their y_hat[:, j] vs y_true[:,j]. These plots should have "n" points each.

Inyrkz commented 5 months ago

@miquelduranfrigola, here is the update.

k-neighbors R-squared score MSE RMSE MAE Notebooks
1 -0.602214097751973 0.3913663663557547 0.6242192872456808 0.4596237749567715 link
3 -0.10868162736446814 0.2708968508285105 0.5193920744203449 0.4180302636660015 link
miquelduranfrigola commented 5 months ago

Thanks @Inyrkz - it seems to me that we are hitting a dead end. Do you agree?

Inyrkz commented 5 months ago

I agree

On Wed, 28 Feb 2024, 16:25 Miquel Duran-Frigola, @.***> wrote:

Thanks @Inyrkz https://github.com/Inyrkz - it seems to me that we are hitting a dead end. Do you agree?

— Reply to this email directly, view it on GitHub https://github.com/ersilia-os/ersilia/issues/752#issuecomment-1969223258, or unsubscribe https://github.com/notifications/unsubscribe-auth/ANGRFVY6C57QZUOWN7EHKETYV5D55AVCNFSM6AAAAAA2TU5TOGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNRZGIZDGMRVHA . You are receiving this because you were mentioned.Message ID: @.***>

miquelduranfrigola commented 5 months ago

Ok, thanks @Inyrkz . The good news is that I will be able to use this for pharmacogenomics! So certainly not lost time, I promise. It was worth the effort.

@DhanshreeA we need to come up with a way of keeping track of "abandoned" aka archived models, especially if we got valuable insights and work from them. Let's discuss

GemmaTuron commented 5 months ago

We have done a lot of work here! I will archive the model repository (eos3nl8) and add it to the list of archived - let's see if we could rescue that eventually