🦠 Model Request: DILI Predictor

Zainab-ik commented 9 months ago

Model Name

Early prediction of Drug-Induced Liver Injury

Model Description

The DILI-Predictor predicts 11 features related to DILI toxicity, including in-vivo and in-vitro and physicochemical parameters. It has been developed by the Broad Institute, using the DILIst dataset (1020 compounds) from the FDA, and achieved an accuracy balance of 70% on a test set of 255 compounds held out from the same dataset. The authors show how the model can correctly predict compounds that are not toxic in human despite being toxic in mice.

Slug

DILI-predictor

Tag

Toxicity, Metabolism

Publication

https://www.biorxiv.org/content/10.1101/2024.01.10.575128v1.full

Source Code

https://github.com/srijitseal/DILI

License

MIT

Zainab-ik commented 9 months ago

@GemmaTuron @DhanshreeA Kindly review and approve.

GemmaTuron commented 9 months ago

Hi @Zainab-ik

I've modified the description - let's try not to copy paste parts of the abstract, make it a bit more explanatory - the description contained most important items but I think you were missing highligthing the model accuracy. the tags are Python strings, so case sensitive, and they always start with Caps

GemmaTuron commented 9 months ago

/approve

github-actions[bot] commented 9 months ago

New Model Repository Created! 🎉

@Zainab-ik ersilia model respository has been successfully created and is available at:

🔗 ersilia-os/eos5gge

Next Steps ⭐

Now that your new model respository has been created, you are ready to start contributing to it!

Here are some brief starter steps for contributing to your new model repository:

Note: Many of the bullet points below will have extra links if this is your first time contributing to a GitHub repository

🍴 Get started by creating a fork of your new model repository - docs
👯 Clone your forked repository - docs
✏️ Make edits to your new forked model repository - docs - Edits might include:
- Updating the README.md file to accurately describe your model
- Add source code for your model
- Adding documentation for your model
🚀 Open a Pull Request from your forked repository to the original repository. This will allow you to bring your local changes into the new ersilia model repository that was just created! - docs

Additional Resources 📚

If you have any questions, please feel free to open an issue and get support from the community!

Zainab-ik commented 9 months ago

After reading through the publication, Dataset Used

DILI Toxicity dataset: This combines the DILIst and the DILIrank datasets. The datasets were combined into a standard format: DILI positive and DILI negative. The compound smiles were standardized using the MolVS Standardizer, which resulted in the GOLD standard DILI dataset. It comprises 1,275 compounds (820 toxic and 455 non-toxic).
Proxy DILI dataset: eleven proxy DILI labels, including the Pharmacokinetics parameters, were combined to give 18,679 compounds. After smiles standardization, we have the Proxy DILI-Dataset comprising 15,080 compounds.

Features used

93 bits Morgan fingerprints
100 MACCS key
346 Mordred descriptors
15 physicochemical parameters resulting in 193-bit vector structural fingerprints and 361 molecular descriptors.

Models

Random forest model: An individual RF model was built for each 11 proxy DILI endpoint with a 5-fold stratified cross-validation and random halving search hyperparameter optimization. The individual model was assessed based on the AUC-ROC curve, and its performance on other proxy-DILI labels was evaluated by comparing the F1 score and Likelihood Ratios.
The individual proxy DILI model was trained and evaluated on the GOLD standard DILI dataset.
Random forest regressor model: 2 RF regressor models were built to predict the pharmacokinetics parameter—both Unbound plasma concentration (Unbound Cmax) and Total plasma concentration (Total Cmax).
Model for DILI prediction: 5 different RF classifier models were built based on five different features.
1. 193-bit structural fingerprint
2. 361 molecular descriptors
3. combination of 1 & 2
4. predicted eleven proxy-DILI labels and two predicted pharmacokinetic parameters
5. combination of all three feature spaces (1,2 & 4).

The best-performing model was the combination of all three feature spaces. All predictions were evaluated using the;

F1 score
Positive prediction value
Likelihood ratio
AUC-ROC
sensitivity
specificity
balanced accuracy (BA),
Mathew’s correlation constant (MCC)
average precision score (AP)

@GemmaTuron @DhanshreeA

GemmaTuron commented 9 months ago

Great Summary @Zainab-ik ! Next steps would be:

Create a conda environment (python 3.10 maybe?) and install the required dependencies, making sure to note down the versions
Try to run the model following the instructions from the authors - the checkpoints are available for download

Zainab-ik commented 8 months ago

Hi @GemmaTuron

Attached is the link to the notebook in the implementation file here.

I've gone through it and tried running it. However, I keep getting the error,

ModuleNotFoundError Traceback (most recent call last)
in () 5 import numpy as np 6 import pandas as pd ----> 7 from rdkit import Chem 8 from rdkit.Chem import inchi 9 from rdkit.Chem.MolStandardize import rdMolStandardize

ModuleNotFoundError: No module named 'rdkit'

NOTE: If your import is failing due to a missing package, you can manually install dependencies using either !pip or !apt.

What I've tried.

Installed miniconda
Created python enviroment
Installed Rdkit with both pip, apt and through kora, the error persists. Kindly point me in the right direction.

GemmaTuron commented 8 months ago

Hi @Zainab-ik

Where are you running the notebook, in your local? Make sure that the right conda environment is active when running the notebook. I recommend using Visual Studio code for that, which makes it easy to integrate code development. The steps you list look good, if in the terminal you:

Activate conda environment
type$ python
type import rdkit

Does the package get imported? this will tell you if you are failing at installing the package

GemmaTuron commented 8 months ago

Also make sure you have installed the package in the right conda env, not in your base

DhanshreeA commented 8 months ago

@GemmaTuron @Zainab-ik anything you need from me?

GemmaTuron commented 8 months ago

I did not yet hear back from @Zainab-ik, can you update us?

Zainab-ik commented 8 months ago

Hi, There's a change of OS from my side. I'd start all over and update. Thank you @DhanshreeA @GemmaTuron

Zainab-ik commented 8 months ago

Update

RdKit works successfully and I started the run all over with the help of @DhanshreeA

While running this For Loop

    for s in smiles_list:
        smiles = unquote(s)

        smiles_r = standardized_smiles(smiles)
        test = {'smiles_r':  [smiles_r]
                    }
        test = pd.DataFrame(test)

        desc=pd.read_csv("all_features_desc.csv", encoding='windows-1252')

        molecule = Chem.MolFromSmiles(smiles_r)     
        #st.image(Draw.MolToImage(molecule), width=200)

        test_mfp_Mordred = calc_all_fp_desc(test)
        test_mfp_Mordred_liv = predict_liv_all(test_mfp_Mordred)
        test_mfp_Mordred_liv_values = test_mfp_Mordred_liv.T.reset_index().rename(columns={"index":"name", 0: "value"})

        interpret, y_proba, y_pred = predict_DILI(test_mfp_Mordred_liv)   
        interpret = pd.merge(interpret, desc, right_on="name", left_on="name", how="outer")
        interpret = pd.merge(interpret, test_mfp_Mordred_liv_values, right_on="name", left_on="name", how="inner") 

        print(y_proba[0])
        print(y_pred[0]) 

        if(y_pred[0]==1):
            print("The compound is predicted DILI-Positive")
        if(y_pred[0]==0):
            print("The compound is predicted DILI-Negative")

        print("unbound Cmax: ", np.round(10**-test_mfp_Mordred_liv["median pMolar unbound plasma concentration"][0] *10**6, 2), "uM")
        print("total Cmax: ", np.round(10**-test_mfp_Mordred_liv["median pMolar total plasma concentration"][0] *10**6, 2), "uM")
        print("Most contributing MACCS substructure to DILI toxicity")

        top = interpret[interpret["SHAP"]>0].sort_values(by=["SHAP"], ascending=False)
        proxy_DILI_SHAP_top = pd.merge(info, top[top["name"].isin(liv_data)])
        proxy_DILI_SHAP_top["pred"] = proxy_DILI_SHAP_top["value"]>0.50
        proxy_DILI_SHAP_top["SHAP contribution to Toxicity"] = "Positive"
        proxy_DILI_SHAP_top["smiles"] = smiles_r

        top_positives = top[top["value"]==1]
        top_MACCS= top_positives[top_positives.name.isin(desc.name.to_list()[-166:])].iloc[:1, :]["description"].values[0]
        top_MACCS_value= top_positives[top_positives.name.isin(desc.name.to_list()[-166:])].iloc[:1, :]["value"].values[0]
        top_MACCS_shap= top_positives[top_positives.name.isin(desc.name.to_list()[-166:])].iloc[:1, :]["SHAP"].values[0] 
        top_MACCSsubstructure = Chem.MolFromSmarts(top_MACCS)

        Draw.MolToImage(molecule, highlightAtoms=molecule.GetSubstructMatch(top_MACCSsubstructure), width=400)        
        print("Presence of this substructure contributes", np.round(top_MACCS_shap, 4), "to prediction")

        print("Most contributing MACCS substructure to DILI safety")
        bottom = interpret[interpret["SHAP"]<0].sort_values(by=["SHAP"], ascending=True)
        proxy_DILI_SHAP_bottom = pd.merge(info, bottom[bottom["name"].isin(liv_data)])
        proxy_DILI_SHAP_bottom["pred"] = proxy_DILI_SHAP_bottom["value"]>0.50
        proxy_DILI_SHAP_bottom["SHAP contribution to Toxicity"] = "Negative"
        proxy_DILI_SHAP_bottom["smiles"] = smiles_r

        bottom_positives = bottom[bottom["value"]==1]
        bottom_MACCS= bottom_positives[bottom_positives.name.isin(desc.name.to_list()[-166:])].iloc[:1, :]["description"].values[0]
        bottom_MACCS_value= bottom_positives[bottom_positives.name.isin(desc.name.to_list()[-166:])].iloc[:1, :]["value"].values[0]
        bottom_MACCS_shap= bottom_positives[bottom_positives.name.isin(desc.name.to_list()[-166:])].iloc[:1, :]["SHAP"].values[0]     
        bottom_MACCSsubstructure = Chem.MolFromSmarts(bottom_MACCS)

        Draw.MolToImage(molecule, highlightAtoms=molecule.GetSubstructMatch(bottom_MACCSsubstructure), width=400) 
        print("Presence of this substructure contributes", np.round(bottom_MACCS_shap, 4), "to prediction")

I got the error

ValueError                                Traceback (most recent call last)
[<ipython-input-48-bb1e4f39d195>](https://localhost:8080/#) in <cell line: 1>()
     14 
     15     test_mfp_Mordred = calc_all_fp_desc(test)
---> 16     test_mfp_Mordred_liv = predict_liv_all(test_mfp_Mordred)
     17     test_mfp_Mordred_liv_values = test_mfp_Mordred_liv.T.reset_index().rename(columns={"index":"name", 0: "value"})
     18 

6 frames
[/usr/local/lib/python3.10/dist-packages/sklearn/utils/validation.py](https://localhost:8080/#) in _assert_all_finite(X, allow_nan, msg_dtype, estimator_name, input_name)
    159                 "#estimators-that-handle-nan-values"
    160             )
--> 161         raise ValueError(msg_err)
    162 
    163 

ValueError: Input X contains NaN.
RandomForestClassifier does not accept missing values encoded as NaN natively. For supervised learning, you might want to consider sklearn.ensemble.HistGradientBoostingClassifier and Regressor which accept missing values encoded as NaNs natively. Alternatively, it is possible to preprocess the data, for instance by using an imputer transformer in a pipeline or drop samples with missing values. See https://scikit-learn.org/stable/modules/impute.html You can find a list of all estimators that handle NaN values at the following page: https://scikit-learn.org/stable/modules/impute.html#estimators-that-handle-nan-values

I checked the input data and got

smiles_r    0
Mfp0        0
Mfp1        0
Mfp2        0
Mfp3        0
           ..
WPol        0
Zagreb1     0
Zagreb2     0
mZagreb1    0
mZagreb2    0
Length: 3844, dtype: int64

And I used the SimpleImputer and dropna() functions and the same error persists. This is the link to the notebook - here @DhanshreeA

GemmaTuron commented 4 months ago

Hi @Zainab-ik and @DhanshreeA This is still work in progress, any update? Should I assign this model to one of the new interns or you want to continue working on this @Zainab-ik ?

Zainab-ik commented 4 months ago

Hi @Zainab-ik and @DhanshreeA This is still work in progress, any update? Should I assign this model to one of the new interns or you want to continue working on this @Zainab-ik ?

Hi @GemmaTuron I'd like to continue working on it.

DhanshreeA commented 4 months ago

Awesome @Zainab-ik, let me know if you need any help

GemmaTuron commented 3 months ago

Hello @Zainab-ik

what is the status of this? Please let us know if you have capacity to tackle this because otherwise we will assign it to someone else.

Zainab-ik commented 3 months ago

Hello @Zainab-ik

what is the status of this? Please let us know if you have capacity to tackle this because otherwise we will assign it to someone else.

Hi @GemmaTuron This can be reassigned. Apologies I couldn't get on with it.

GemmaTuron commented 2 months ago

I am going to try out model incorporation with the new ersilia template

GemmaTuron commented 2 months ago

/approve

github-actions[bot] commented 2 months ago

Workflow Failure ❌

@ (or other maintainers) the /approve workflow has failed. View the logs here for more information:

🔗 Workflow logs

You may need to delete the following repo that was created via this workflow run since the run was not fully successful: ersilia-os/eos3n69

GemmaTuron commented 2 months ago

I have deleted the repository @DhanshreeA please check and amend what is failing

GemmaTuron commented 2 months ago

/approve

github-actions[bot] commented 2 months ago

Workflow Failure ❌

@ (or other maintainers) the /approve workflow has failed. View the logs here for more information:

🔗 Workflow logs

You may need to delete the following repo that was created via this workflow run since the run was not fully successful: ersilia-os/eos6ubs

GemmaTuron commented 2 months ago

I've deleted this second repository. I could easily install pyyaml but the error now I do not want to touch as I am unfamiliar with the new eos-template completely. @DhanshreeA let me know

GemmaTuron commented 2 months ago

I have meanwhile used the eos5gge repo to build the model. Maybe it is best we leave that one as completed and we think which models we want to reformat. The ones we use the most, probably. This can also be a good task for Outreachy applicants

DhanshreeA commented 2 months ago

/approve

github-actions[bot] commented 2 months ago

New Model Repository Created! 🎉

@Zainab-ik ersilia model respository has been successfully created and is available at:

🔗 ersilia-os/eos7ioj

Next Steps ⭐

Now that your new model respository has been created, you are ready to start contributing to it!

Here are some brief starter steps for contributing to your new model repository:

Note: Many of the bullet points below will have extra links if this is your first time contributing to a GitHub repository

🍴 Get started by creating a fork of your new model repository - docs
👯 Clone your forked repository - docs
✏️ Make edits to your new forked model repository - docs - Edits might include:
- Updating the README.md file to accurately describe your model
- Add source code for your model
- Adding documentation for your model
🚀 Open a Pull Request from your forked repository to the original repository. This will allow you to bring your local changes into the new ersilia model repository that was just created! - docs

Additional Resources 📚

If you have any questions, please feel free to open an issue and get support from the community!

GemmaTuron commented 2 months ago

We will not re factor this model into the new one at the moment. I have deleted the repository

ersilia-os / ersilia