ersilia-os / ersilia

The Ersilia Model Hub, a repository of AI/ML models for infectious and neglected disease research.
https://ersilia.io
GNU General Public License v3.0
218 stars 147 forks source link

🦠 Model Request: HIA #425

Closed GemmaTuron closed 2 years ago

GemmaTuron commented 2 years ago

Model Title

HIA (TDC dataset)

Publication

Hello @pauline-banye !

As part of your Outreachy contribution, we have assigned you the dataset "HIA" from the Therapeutics Data Commons to try and build a binary classification ML model. Please copy the provided Google Colab template and use this issue to provide updates on the progress. We'll value not only being able to build the model but also interpreting its results.

Code

No response

paulinebanye commented 2 years ago

Model Title

HIA (TDC dataset)

Publication

Hello @pauline-banye !

As part of your Outreachy contribution, we have assigned you the dataset "HIA" from the Therapeutics Data Commons to try and build a binary classification ML model. Please copy the provided Google Colab template and use this issue to provide updates on the progress. We'll value not only being able to build the model but also interpreting its results.

Code

No response

On it @GemmaTuron!

paulinebanye commented 2 years ago

Hi @GemmaTuron, my apologies for the delayed response. I have been under the weather for a couple of days but I'm getting much better. My work on this has been sporadic but I would like to provide an update.

OVERVIEW Theoretical models for the prediction of absorption, distribution, metabolism and excretion (ADME) properties play extremely important roles in support of the drug development process.

I was tasked with building a binary classification model for HIA. Human intestinal absorption is the ability of the human gastrointestinal system to absorb orally administered drugs into the bloodstream. HIA is a vital factor influencing the transportation of drugs to the intended targets in the body.

From my understanding, the primary goal of predicting HIA is to measure human oral absorption, which is a determining factor in early drug discovery. Due to the complexity of drug absorption, it's challenging to analyse and build useful statistical models for diverse sets of pharmaceuticals.

TASKS Install Therapeutics Data Commons and load the dataset. The Therapeutics Data Commons (TDC) is a resource comprising of ML tasks, ML-ready datasets, and curated benchmarks to support the development of artificial intelligence for drug discovery. HIA is classified as a Single-instance Prediction Absorption Distribution Metabolism Excretion (ADME) Dataset in TDC. To begin working with the data, I had to split the dataset using the get_split function into the training, validation and test sets.

lzy Then I saved the split datasets in my Google Drive as .csv files using the pandas package.

paulinebanye commented 2 years ago

Next step involved analysing the data. I observed that the HIA dataset from TDC comprises of a total of 578 molecules, 500 of which are active and 78 inactive.

For further investigations, I used the matplotlib package to plot the data outcomes. Matplotlib is a comprehensive library for creating static, animated, and interactive visualizations in Python. I utilised this library to create a bar graph and pie chart for visual interpretation of the resulting data.

From the results of this dataset, I observed that:

  1. The dataset is divided into 2 distinct classes thus it is a classification problem and not a regression problem.
  2. A vast majority of the molecules in the HIA dataset are in an active state
  3. Probability of 0: the probability that a molecule is inactive in the given assay
  4. Probability of 1: the probability that a molecule is active in the given assay
  5. From the charts, it is visible that in this train set, the amount of active molecules far outweighs the inactive molecules.
paulinebanye commented 2 years ago

I also created graphical representations of active molecules and inactive molecules using the RDKIT python package. The rdkit library is a Python library that allows us to handle chemical structures and the calculation of their molecular properties. I used the rdkit library to generate visual representations of the first 3 active and inactive molecules.

active molecules lzy

inactive molecules lzy

paulinebanye commented 2 years ago

Next step involved training the data. The data comprises of the Drug ID, Drug (Smile) and the Y which indicates the active or inactive state. I prepared a list of the smiles and Y for the train, validation and test datasets. The model was trained using the smile and the value of Y(activity) as inputs.

bio

I installed LazyQsar from GitHub and imported the Morgan Binary Classifier class. LazyQsar is a library that enables us to build models quickly. It is extremely convenient as it converts our smiles into vectors automatically, which can be interpreted by the computer system. This process teaches our model to associate a vector with it's outcome. Thus the model can correctly identify active and inactive even when new molecules or data is introduced.

This process of training where we input the smiles and results and LazyQSAR package automatically converts the model to morgan fingerprints is called FITTING.

The steps I took were:

The model was saved as a joblib file to avoid the process of retraining if the model is needed subsequently. Joblib files provide better performance and reproducibility when working with massive datasets or long running jobs.

GemmaTuron commented 2 years ago

Hi @pauline-banye !

very good job, please also add your comments on model performance before we close this contribution!

paulinebanye commented 2 years ago

Once the model was trained, I evaluated the model performance by predicting the results using the validation and test sets. The list of smiles data and the result is passed to the model and compared to the actual value.

The outcome of this classification process is a probability of 0 or 1, which can be either a true or false positive or a true or false negative.

Note that these values are dependent on the threshold set for the classification. The default threshold is 0.5 but if the threshold is lowered to 0.4 for example, molecules with a probability of 0.4 which were previously classified as negative would be classified as positive.

The metrics used for evaluating model performance on the validation and test sets are:

  1. AUROC value
  2. ROC Curve
  3. Contingency Table
  4. Precision score
  5. Recall score
  6. Accuracy
paulinebanye commented 2 years ago
  1. AUROC value The Area Under the Curve (AUC) measures the ability of a classifier to distinguish between classes. The higher the AUROC value, the better the performance of the model at identifying distinct classes. An AUROC value of 0.7 - 0.8 is acceptable, 0.8 - 0.9 is excellent, and more than 0.9 is outstanding.

    • validation set - AUROC 0.9775641025641025
    • test set - AUROC 0.9706999457406403

For this model, the AUROC values were outstanding:

paulinebanye commented 2 years ago
  1. ROC Curve A receiver operating characteristic curve, or ROC curve, is a graphical plot that illustrates the performance of a binary classification model. The ROC curve is plotted with TruePositiveRate(TPR) against the FalsePositiveRate(FPR) where TPR is on the y-axis and FPR is on the x-axis. It uses probability to tell us how well a model separates the classes. Classifiers that give curves closer to the top-left corner indicate a better performance.

    Validation set ROC Curve roc1

    Test set ROC Curve roc1

    This model exhibits an extremely high performance. It's AUC scores for the validation and test set are 0.98 and 0.97 respectively and the ROC plot is almost vertical along the y axis. The closer the curve comes to the 45-degree diagonal of the ROC space, the less accurate the test.

paulinebanye commented 2 years ago
  1. Contingency Table Contingency tables are used to record the number of molecules assigned to different classes after a test has been performed. It displays a visual representation of the quantity of molecules which are correctly or incorrectly assigned to each specific class.

    Validation set - Contingency table roc1

    From the contingency table above, we can determine that out of the total 58 molecules in the valid set, 52 were correctly classified as active (positive), 3 were correctly classified as inactive (negative), 3 were incorrectly classified as positive and none were false negatives.

    Test set - Contingency table roc1

    From the contingency table, we can determine that out of the total 116 molecules in the valid set, 97 were correctly classified as active (positive), 13 were correctly classified as inactive (negative), 6 were incorrectly classified as positive and none were false negatives.

    The numbers retrieved from the contingency tables form the basis of further analysis using classification metrics such as accuracy, precision and recall.

paulinebanye commented 2 years ago

Hi @pauline-banye !

very good job, please also add your comments on model performance before we close this contribution!

Thanks so much @GemmaTuron 😊! I'm currently working on the few remaining evaluations. I just need to add the precision score, recall score and accuracy, then I should be done!

paulinebanye commented 2 years ago
  1. Precision score Precision is one of the indicators of a machine learning model's performance. It outlines the quality of a positive prediction made by the model. It is a measure of the number of positive class predictions that actually belong to the positive class.

    It is calculated as the number of true positives divided by the total number of positive predictions (i.e., the number of true positives plus the number of false positives). The best value is 1 and the worst value is 0.

    Validation set - Precision score roc1

    Test set - Precision set roc1

    The precision scores for the validation set and test sets indicate that approximately 0.95 and 0.94 respectively of the positive identifications are actually correct.

paulinebanye commented 2 years ago
  1. Recall score Recall tries to answer the question - how many positives are we able to identify? It is calculated as the ratio between the numbers of positive molecules which were correctly identified as Positive to the total number of Positive molecules in the set.

    Recall quantifies the number of positive class predictions made out of all positive examples in the dataset. The higher the recall value, the more positive samples detected. The highest recall value for a good classifier should be 1 with 0 being the lowest value.

    Validation set - Recall score roc1

    Test set - Recall score roc1

    From the images above, we observe that recall scores for both sets is 1 indicating that the classifier is good. Recall value becomes 1 only when the numerator and denominator are equal i.e True Positives = True Positives + False Negatives, this also means that the value of False Negatives is zero.

paulinebanye commented 2 years ago
  1. Accuracy Accuracy is one of the metrics used for evaluating classification models. It measures the number of correctly predicted molecules out of all the molecules.. It provides an estimate of how often the model classifies a data point correctly.

    For binary classification, accuracy can also be calculated in terms of positives and negatives as follows: (TruePositive + TrueNegative) / (TruePositive + TrueNegative + FalseNegative + FalsePositive)

    Validation set - Accuracy roc1

    Test set - Accuracy roc1

    Our model has an accuracy score of approximately 0.95. This means that 95% of the molecules were predicted correctly.

    In machine learning, 70% accuracy is a great model performance. An accuracy measure of anything between 70%-90% is not only ideal, it's realistic and consistent with industry standards.

paulinebanye commented 2 years ago

In summary,

Overall the model performance on both the validation and test sets was excellent. link to colab

paulinebanye commented 2 years ago

Hi @GemmaTuron 👋 , I have completed my evaluation on the model performance. Please can you review it?

Is there anything else I need to include in my analysis?

EstherIdabor commented 2 years ago

@pauline-banye The performance of your model is stellar and your analysis is beautiful, well-done.

paulinebanye commented 2 years ago

@pauline-banye The performance of your model is stellar and your analysis is beautiful, well-done.

@EstherIdabor Thanks so much sis 😊. Could you send me a link to yours pls?

GemmaTuron commented 2 years ago

Hello @pauline-banye !

Great job thanks for the effort! I'll mark this as closed, please focus on your final application to Outreachy.

paulinebanye commented 2 years ago

Hello @pauline-banye !

Great job thanks for the effort! I'll mark this as closed, please focus on your final application to Outreachy.

Thank you so much @GemmaTuron 😃! I really had fun playing around with the metrics.

I initially thought my model was faulty because the ROC curve didn't actually have a curve 🤭.