ersilia-os / ersilia

The Ersilia Model Hub, a repository of AI/ML models for infectious and neglected disease research.
https://ersilia.io
GNU General Public License v3.0
225 stars 147 forks source link

🦠 Model Request: DILI Predictor #968

Closed Zainab-ik closed 2 months ago

Zainab-ik commented 9 months ago

Model Name

Early prediction of Drug-Induced Liver Injury

Model Description

The DILI-Predictor predicts 11 features related to DILI toxicity, including in-vivo and in-vitro and physicochemical parameters. It has been developed by the Broad Institute, using the DILIst dataset (1020 compounds) from the FDA, and achieved an accuracy balance of 70% on a test set of 255 compounds held out from the same dataset. The authors show how the model can correctly predict compounds that are not toxic in human despite being toxic in mice.

Slug

DILI-predictor

Tag

Toxicity, Metabolism

Publication

https://www.biorxiv.org/content/10.1101/2024.01.10.575128v1.full

Source Code

https://github.com/srijitseal/DILI

License

MIT

Zainab-ik commented 9 months ago

@GemmaTuron @DhanshreeA Kindly review and approve.

GemmaTuron commented 9 months ago

Hi @Zainab-ik

I've modified the description - let's try not to copy paste parts of the abstract, make it a bit more explanatory - the description contained most important items but I think you were missing highligthing the model accuracy. the tags are Python strings, so case sensitive, and they always start with Caps

GemmaTuron commented 9 months ago

/approve

github-actions[bot] commented 9 months ago

New Model Repository Created! πŸŽ‰

@Zainab-ik ersilia model respository has been successfully created and is available at:

πŸ”— ersilia-os/eos5gge

Next Steps ⭐

Now that your new model respository has been created, you are ready to start contributing to it!

Here are some brief starter steps for contributing to your new model repository:

Note: Many of the bullet points below will have extra links if this is your first time contributing to a GitHub repository

Additional Resources πŸ“š

If you have any questions, please feel free to open an issue and get support from the community!

Zainab-ik commented 9 months ago

After reading through the publication, Dataset Used

Features used

Models

The best-performing model was the combination of all three feature spaces. All predictions were evaluated using the;

@GemmaTuron @DhanshreeA

GemmaTuron commented 9 months ago

Great Summary @Zainab-ik ! Next steps would be:

Zainab-ik commented 8 months ago

Hi @GemmaTuron

Attached is the link to the notebook in the implementation file here.

I've gone through it and tried running it. However, I keep getting the error,


ModuleNotFoundError Traceback (most recent call last)

in () 5 import numpy as np 6 import pandas as pd ----> 7 from rdkit import Chem 8 from rdkit.Chem import inchi 9 from rdkit.Chem.MolStandardize import rdMolStandardize

ModuleNotFoundError: No module named 'rdkit'


NOTE: If your import is failing due to a missing package, you can manually install dependencies using either !pip or !apt.

What I've tried.

GemmaTuron commented 8 months ago

Hi @Zainab-ik

Where are you running the notebook, in your local? Make sure that the right conda environment is active when running the notebook. I recommend using Visual Studio code for that, which makes it easy to integrate code development. The steps you list look good, if in the terminal you:

Does the package get imported? this will tell you if you are failing at installing the package

GemmaTuron commented 8 months ago

Also make sure you have installed the package in the right conda env, not in your base

DhanshreeA commented 8 months ago

@GemmaTuron @Zainab-ik anything you need from me?

GemmaTuron commented 8 months ago

I did not yet hear back from @Zainab-ik, can you update us?

Zainab-ik commented 8 months ago

Hi, There's a change of OS from my side. I'd start all over and update. Thank you @DhanshreeA @GemmaTuron

Zainab-ik commented 8 months ago

Update

RdKit works successfully and I started the run all over with the help of @DhanshreeA

While running this For Loop

    for s in smiles_list:
        smiles = unquote(s)

        smiles_r = standardized_smiles(smiles)
        test = {'smiles_r':  [smiles_r]
                    }
        test = pd.DataFrame(test)

        desc=pd.read_csv("all_features_desc.csv", encoding='windows-1252')

        molecule = Chem.MolFromSmiles(smiles_r)     
        #st.image(Draw.MolToImage(molecule), width=200)

        test_mfp_Mordred = calc_all_fp_desc(test)
        test_mfp_Mordred_liv = predict_liv_all(test_mfp_Mordred)
        test_mfp_Mordred_liv_values = test_mfp_Mordred_liv.T.reset_index().rename(columns={"index":"name", 0: "value"})

        interpret, y_proba, y_pred = predict_DILI(test_mfp_Mordred_liv)   
        interpret = pd.merge(interpret, desc, right_on="name", left_on="name", how="outer")
        interpret = pd.merge(interpret, test_mfp_Mordred_liv_values, right_on="name", left_on="name", how="inner") 

        print(y_proba[0])
        print(y_pred[0]) 

        if(y_pred[0]==1):
            print("The compound is predicted DILI-Positive")
        if(y_pred[0]==0):
            print("The compound is predicted DILI-Negative")

        print("unbound Cmax: ", np.round(10**-test_mfp_Mordred_liv["median pMolar unbound plasma concentration"][0] *10**6, 2), "uM")
        print("total Cmax: ", np.round(10**-test_mfp_Mordred_liv["median pMolar total plasma concentration"][0] *10**6, 2), "uM")
        print("Most contributing MACCS substructure to DILI toxicity")

        top = interpret[interpret["SHAP"]>0].sort_values(by=["SHAP"], ascending=False)
        proxy_DILI_SHAP_top = pd.merge(info, top[top["name"].isin(liv_data)])
        proxy_DILI_SHAP_top["pred"] = proxy_DILI_SHAP_top["value"]>0.50
        proxy_DILI_SHAP_top["SHAP contribution to Toxicity"] = "Positive"
        proxy_DILI_SHAP_top["smiles"] = smiles_r

        top_positives = top[top["value"]==1]
        top_MACCS= top_positives[top_positives.name.isin(desc.name.to_list()[-166:])].iloc[:1, :]["description"].values[0]
        top_MACCS_value= top_positives[top_positives.name.isin(desc.name.to_list()[-166:])].iloc[:1, :]["value"].values[0]
        top_MACCS_shap= top_positives[top_positives.name.isin(desc.name.to_list()[-166:])].iloc[:1, :]["SHAP"].values[0] 
        top_MACCSsubstructure = Chem.MolFromSmarts(top_MACCS)

        Draw.MolToImage(molecule, highlightAtoms=molecule.GetSubstructMatch(top_MACCSsubstructure), width=400)        
        print("Presence of this substructure contributes", np.round(top_MACCS_shap, 4), "to prediction")

        print("Most contributing MACCS substructure to DILI safety")
        bottom = interpret[interpret["SHAP"]<0].sort_values(by=["SHAP"], ascending=True)
        proxy_DILI_SHAP_bottom = pd.merge(info, bottom[bottom["name"].isin(liv_data)])
        proxy_DILI_SHAP_bottom["pred"] = proxy_DILI_SHAP_bottom["value"]>0.50
        proxy_DILI_SHAP_bottom["SHAP contribution to Toxicity"] = "Negative"
        proxy_DILI_SHAP_bottom["smiles"] = smiles_r

        bottom_positives = bottom[bottom["value"]==1]
        bottom_MACCS= bottom_positives[bottom_positives.name.isin(desc.name.to_list()[-166:])].iloc[:1, :]["description"].values[0]
        bottom_MACCS_value= bottom_positives[bottom_positives.name.isin(desc.name.to_list()[-166:])].iloc[:1, :]["value"].values[0]
        bottom_MACCS_shap= bottom_positives[bottom_positives.name.isin(desc.name.to_list()[-166:])].iloc[:1, :]["SHAP"].values[0]     
        bottom_MACCSsubstructure = Chem.MolFromSmarts(bottom_MACCS)

        Draw.MolToImage(molecule, highlightAtoms=molecule.GetSubstructMatch(bottom_MACCSsubstructure), width=400) 
        print("Presence of this substructure contributes", np.round(bottom_MACCS_shap, 4), "to prediction")

I got the error

ValueError                                Traceback (most recent call last)
[<ipython-input-48-bb1e4f39d195>](https://localhost:8080/#) in <cell line: 1>()
     14 
     15     test_mfp_Mordred = calc_all_fp_desc(test)
---> 16     test_mfp_Mordred_liv = predict_liv_all(test_mfp_Mordred)
     17     test_mfp_Mordred_liv_values = test_mfp_Mordred_liv.T.reset_index().rename(columns={"index":"name", 0: "value"})
     18 

6 frames
[/usr/local/lib/python3.10/dist-packages/sklearn/utils/validation.py](https://localhost:8080/#) in _assert_all_finite(X, allow_nan, msg_dtype, estimator_name, input_name)
    159                 "#estimators-that-handle-nan-values"
    160             )
--> 161         raise ValueError(msg_err)
    162 
    163 

ValueError: Input X contains NaN.
RandomForestClassifier does not accept missing values encoded as NaN natively. For supervised learning, you might want to consider sklearn.ensemble.HistGradientBoostingClassifier and Regressor which accept missing values encoded as NaNs natively. Alternatively, it is possible to preprocess the data, for instance by using an imputer transformer in a pipeline or drop samples with missing values. See https://scikit-learn.org/stable/modules/impute.html You can find a list of all estimators that handle NaN values at the following page: https://scikit-learn.org/stable/modules/impute.html#estimators-that-handle-nan-values

I checked the input data and got

smiles_r    0
Mfp0        0
Mfp1        0
Mfp2        0
Mfp3        0
           ..
WPol        0
Zagreb1     0
Zagreb2     0
mZagreb1    0
mZagreb2    0
Length: 3844, dtype: int64

And I used the SimpleImputer and dropna() functions and the same error persists. This is the link to the notebook - here @DhanshreeA

GemmaTuron commented 4 months ago

Hi @Zainab-ik and @DhanshreeA This is still work in progress, any update? Should I assign this model to one of the new interns or you want to continue working on this @Zainab-ik ?

Zainab-ik commented 4 months ago

Hi @Zainab-ik and @DhanshreeA This is still work in progress, any update? Should I assign this model to one of the new interns or you want to continue working on this @Zainab-ik ?

Hi @GemmaTuron I'd like to continue working on it.

DhanshreeA commented 4 months ago

Awesome @Zainab-ik, let me know if you need any help

GemmaTuron commented 3 months ago

Hello @Zainab-ik

what is the status of this? Please let us know if you have capacity to tackle this because otherwise we will assign it to someone else.

Zainab-ik commented 3 months ago

Hello @Zainab-ik

what is the status of this? Please let us know if you have capacity to tackle this because otherwise we will assign it to someone else.

Hi @GemmaTuron This can be reassigned. Apologies I couldn't get on with it.

GemmaTuron commented 2 months ago

I am going to try out model incorporation with the new ersilia template

GemmaTuron commented 2 months ago

/approve

github-actions[bot] commented 2 months ago

Workflow Failure ❌

@ (or other maintainers) the /approve workflow has failed. View the logs here for more information:

πŸ”— Workflow logs

You may need to delete the following repo that was created via this workflow run since the run was not fully successful: ersilia-os/eos3n69

GemmaTuron commented 2 months ago

I have deleted the repository @DhanshreeA please check and amend what is failing

GemmaTuron commented 2 months ago

/approve

github-actions[bot] commented 2 months ago

Workflow Failure ❌

@ (or other maintainers) the /approve workflow has failed. View the logs here for more information:

πŸ”— Workflow logs

You may need to delete the following repo that was created via this workflow run since the run was not fully successful: ersilia-os/eos6ubs

GemmaTuron commented 2 months ago

I've deleted this second repository. I could easily install pyyaml but the error now I do not want to touch as I am unfamiliar with the new eos-template completely. @DhanshreeA let me know

GemmaTuron commented 2 months ago

I have meanwhile used the eos5gge repo to build the model. Maybe it is best we leave that one as completed and we think which models we want to reformat. The ones we use the most, probably. This can also be a good task for Outreachy applicants

DhanshreeA commented 2 months ago

/approve

github-actions[bot] commented 2 months ago

New Model Repository Created! πŸŽ‰

@Zainab-ik ersilia model respository has been successfully created and is available at:

πŸ”— ersilia-os/eos7ioj

Next Steps ⭐

Now that your new model respository has been created, you are ready to start contributing to it!

Here are some brief starter steps for contributing to your new model repository:

Note: Many of the bullet points below will have extra links if this is your first time contributing to a GitHub repository

Additional Resources πŸ“š

If you have any questions, please feel free to open an issue and get support from the community!

GemmaTuron commented 2 months ago

We will not re factor this model into the new one at the moment. I have deleted the repository