ersilia-os / ersilia

The Ersilia Model Hub, a repository of AI/ML models for infectious and neglected disease research.
https://ersilia.io
GNU General Public License v3.0
225 stars 146 forks source link

🦠 Model Request: TDC Skin Reaction #421

Closed GemmaTuron closed 2 years ago

GemmaTuron commented 2 years ago

Model Title

Skin Reaction (TDC dataset)

Publication

Hello @alaminumar!

As part of your Outreachy contribution, we have assigned you the dataset "Skin Reaction" from the Therapeutics Data Commons to try and build a binary classification ML model. Please copy the provided Google Colab template and use this issue to provide updates on the progress. We'll value not only being able to build the model but also interpreting its results.

Code

No response

alaminumar commented 2 years ago

Thanks Gemma.

alaminumar commented 2 years ago

Sorry for the lateness.

Skin Reaction Dataset overview: I'm working on Skin Reaction Dataset. Exposure to chemical agents can induce an immune reaction in susceptible individuals that lead to skin sensitization. Given the smile drug, can we predict whether it can cause a skin reaction 1 or 0. The Dataset contains 404 drugs.

Importing Dataset: I have successfully installed TDC package and imported the Skin Reaction Dataset from Toxicity Single instance prediction Datasets from the TDC package

alaminumar commented 2 years ago

Splitting Datasets: Successfully Split the model into three datasets.

alaminumar commented 2 years ago

Data Visualization Used matplotlib to visualize the amount of actives(1) and inactives(0) we have in our Dataset. As the image shows this is clearly a binary classification problem matplotlibimage

Using RDKIT we can Visualize the moleculatr structure of our Smiles . Succesfuly imported and drawn an active and inactive molecule respectfully. rdkit

alaminumar commented 2 years ago

@GemmaTuron can you review what i have done. Here is my Colab

GemmaTuron commented 2 years ago

Hi @alaminumar !

Good start, but can you provide an explanation of the model performances?

alaminumar commented 2 years ago

Okay Gemma. First let me explain how we have gotten our models.

Model Training: We train our model when we take Smile Drug as input(X) in our model and pass Y as it's output which is its predicted bioactivity. We use Lazy-QSAR model and MorganBinaryClassifier for our training, thus don't need to convert smiles into signatures as it is done automatically.

Evaluate Model: In order to Evaluate our model, we use the following.

To answer your question Gemma . My model performance for my first iteration was average to poor. So, I decided to double the time we trained the model to 3600 seconds . My first iteration had an AUROC value of 0.61128 and 0.7708 for the validation and test models respectively . As we can see its not that good . Here are the corresponding graphs and data for the second iteration.

Validation Precision 0.7368421052631579 Recall 0.9655172413793104

Test Precision of a Test Set: 0.6125 Recall of a Test Set: 1.0

ConfusionMatrix_Test1

ROC_Test1

AUROC 0.7822066326530612

alaminumar commented 2 years ago

Sorry for the lateness @GemmaTuron . I had to deal with an Emergency.

Updated colab Colab

GemmaTuron commented 2 years ago

Hi @alaminumar

I hope everything is solved, good job on the modelling. I'll mark this as completed and you can move onto finalising your outreachy application!