ersilia-os / ersilia

The Ersilia Model Hub, a repository of AI/ML models for infectious and neglected disease research.
https://ersilia.io
GNU General Public License v3.0
210 stars 140 forks source link

🦠 Model Request: SARS-CoV-2 In Vitro, Touret et al. #465

Closed GemmaTuron closed 1 year ago

GemmaTuron commented 1 year ago

Model Title

SARS-CoV-2 In Vitro, Touret et al.

Publication

Hello @natividadesusana!

As part of your Outreachy contribution, we have assigned you the dataset "SARS-CoV-2 In Vitro, Touret et al." from the Therapeutics Data Commons to try and build a binary classification ML model. Please copy the provided Google Colab template and use this issue to provide updates on the progress. We'll value not only being able to build the model but also interpreting its results.

Code

No response

natividadesusana commented 1 year ago

Hello @GemmaTuron, thank you so much! I'm getting an error loading my TDC data using the specific assay assigned to me.

Could you please check my progress? https://colab.research.google.com/drive/1B5_fclOu1ehK8LsEbyGPX3ftIIvGzNhX?usp=sharing

EstherIdabor commented 1 year ago

Hi @natividadesusana you have to grant access to your colab, by clicking share

natividadesusana commented 1 year ago

Oh, right! It's done, @EstherIdabor could you try again? Thanks!

Malikbadmus commented 1 year ago

@natividadesusana , the label list you are trying to retrieve from the TDC got the wrong value.

EstherIdabor commented 1 year ago

@natividadesusana You have been able to fix that right?, I think you should increase your training time rather than just leaving it at the default time which is usually not enough, and also because the model is already at a disadvantage being exposed to an imbalanced dataset, increasing the time should improve the performance.

carcablop commented 1 year ago

Hello @natividadesusana I see that you have an error when you try to run the confusion matrix. The error is when importing ConfusionMatrizDisplay. I recommend that you do it like this: from sklearn.metrics import ConfusionMatrixDisplay

And try doing it this way.

confusion_matrix_valid= confusion_matrix(Y_test, new_Y_prediction_test)

cfmatrdisp= ConfusionMatrixDisplay(confusion_matrix=confusion_matrix_valid)

I saw that you had a small syntax error when you assign a title to a plot. You have it like this: plt.ttle("ConfusionMatrix_validation"). And it is plt.title("ConfusionMatrix_validation")

Since you have a very unbalanced data set, it is very important that you see your confusion matrix, this will help you measure the performance of your model, so you can also make decisions about which metric is more useful for you, and how you could improve the performance.

carcablop commented 1 year ago

Another thing I've seen @natividadesusana In the roc curve part, I see that you are using the variable "Y_disc_predict_valid", these results of this variable I see that they depend on the threshold. In the case of ROC curves this depends on the probability predictions not on the results of the threshold value.

GemmaTuron commented 1 year ago

Hi @natividadesusana ,

Good job, @carcablop, @EstherIdabor and @Malikbadmus provided some relevant feedback for you to consider. Also look for repeated cells, such as mounting google colab twice. I'd suggest adding some explanation here of the results you are getting to make it easier to follow your outreachy contribution and closing the thread afterwards.

natividadesusana commented 1 year ago

The first step was checking out the Therapeutics Data Commons website. They give instructions and guidelines for installing required packages, how to use functions and how to access datasets through the provided APIs. On google Colab (using the provided LazyQSAR notebook template, I installed Therapeutics Data Commons by running !pip install PyTDC and then importing tdc using import tdc.

Captura de Tela 2022-10-31 às 19 24 10
natividadesusana commented 1 year ago

This is a preview of the dataset.

Captura de Tela 2022-10-31 às 19 25 41
natividadesusana commented 1 year ago

Then the data has to be split into train, test and validation sets.

Captura de Tela 2022-10-31 às 19 32 00
natividadesusana commented 1 year ago

Number of molecules in dataset for train is 1039 Number of Molecules in dataset for test is 297 Number of Molecules in dataset for valid is 148


The number of inactives in train dataset: 977 The number of actives in train dataset: 62

Number of inactive in testing dataset: 278 The number of active in testing dataset: 19

Number of inactive in valid dataset: 141 Number of active in valid dataset: 7


natividadesusana commented 1 year ago

These are the visualizations for the datasets using matplotlib.

Captura de Tela 2022-10-31 às 19 44 41
natividadesusana commented 1 year ago

Checking molecules to draw an active molecule and an inactive molecule using the RDKIT package.

Captura de Tela 2022-10-31 às 19 48 47
natividadesusana commented 1 year ago

Checking active molecules.

Captura de Tela 2022-10-31 às 19 53 41
natividadesusana commented 1 year ago

Selecting random active and inactive molecule.

Captura de Tela 2022-10-31 às 19 58 55
natividadesusana commented 1 year ago

Using rdkit to draw active and inactive molecules.

Captura de Tela 2022-10-31 às 20 05 25
natividadesusana commented 1 year ago

These are the ROC curves obtained.

Captura de Tela 2022-10-31 às 20 07 04
natividadesusana commented 1 year ago

Obtain the precision and recall of the Validation set.

Captura de Tela 2022-10-31 às 20 10 41
natividadesusana commented 1 year ago

Getting the same metrics for a test suite

Captura de Tela 2022-10-31 às 20 12 04
natividadesusana commented 1 year ago

Obtaining the precision and recall of the Validation set.

Captura de Tela 2022-10-31 às 20 15 06
natividadesusana commented 1 year ago

Hi @GemmaTuron , Could you please help me, an error keeps occurring when getting a contingency table with the validation set. Thanks.

Captura de Tela 2022-10-31 às 20 17 02
ZakiaYahya commented 1 year ago

Hi @natividadesusana it seems like you didn't define cm anywhere in your code, that's why it giving you this error. Try to fix this using from sklearn import metrics contigency_table = metrics.confusion_matrix(Y_valid, Y_disc_predict_valid) cm_display = metrics.ConfusionMatrixDisplay(confusion_matrix = contigency_table, display_labels = ["Not Toxic", "Toxic"]) cm_display.plot() plt.title("ConfusionMatrix_validation") plt.savefig("ConfusionMatrix_validation.png") plt.show()

Hope it will work for you.

DhanshreeA commented 1 year ago

Hi @natividadesusana just wanna add a couple of things to @ZakiaYahya 's very useful answer. Perhaps what you wanted to do was something like this:

from sklearn.metrics import ConfusionMatrixDisplay as cm

Moreoever, I think the import for from sklearn.metrics._plot.confusion_matrix is unnecessary. It looks like an internal routine that you likely do not need.

Hi @natividadesusana it seems like you didn't define cm anywhere in your code, that's why it giving you this error. Try to fix this using from sklearn import metrics contigency_table = metrics.confusion_matrix(Y_valid, Y_disc_predict_valid) cm_display = metrics.ConfusionMatrixDisplay(confusion_matrix = contigency_table, display_labels = ["Not Toxic", "Toxic"]) cm_display.plot() plt.title("ConfusionMatrix_validation") plt.savefig("ConfusionMatrix_validation.png") plt.show()

Hope it will work for you.

natividadesusana commented 1 year ago

Hello @ZakiaYahya, That was great and it worked! Thank you!

natividadesusana commented 1 year ago

Hello @DhanshreeA, That worked!! Thank you!!

natividadesusana commented 1 year ago

Getting a contingency table with the validation set.

Captura de Tela 2022-11-01 às 13 33 45 Captura de Tela 2022-11-01 às 13 35 47
carcablop commented 1 year ago

Hello @natividadesusana. I suggest you check your roc curve, I see that it is wrong. The roc curve does not depend on the threshold value, you should not do those calculations for the roc curve or to calculate the AUROC, just pass to that function the results you got from the prediction of the probabilities ("predictprob). You should do it as follows form: `fpr,tpr,=roc_curve(Y_valid,Y_predict_valid). As you can see, the variable "Y_predict_valid" is the one you get as a result of making the predictions. Here you do not apply a threshold. Then just call the function to calculate the auroc, you can do it like this: print("AUROC", auc(fpr, tpr))`

Remember to import the necessary packages: from sklearn.metrics import roc_curve, AUC And finally, you draw the graph.

The values ​​you get when you apply the threshold are what you should use for the confusion matrix. Apply a single threshold value, it can be 0.5, and you can play with those values, until that allows you to make a good analysis of your results.

GemmaTuron commented 1 year ago

Hello @natividadesusana !

Thanks for your contributions to Ersilia during the Outreachy Application Period! We hope you have learnt and enjoyed as much as the Ersilia team did! I will close this issue as this was part of the application period solely.