Closed Zainab-ik closed 2 months ago
@GemmaTuron @DhanshreeA Kindly review and approve.
Hi @Zainab-ik
I've modified the description - let's try not to copy paste parts of the abstract, make it a bit more explanatory - the description contained most important items but I think you were missing highligthing the model accuracy. the tags are Python strings, so case sensitive, and they always start with Caps
/approve
@Zainab-ik ersilia model respository has been successfully created and is available at:
π ersilia-os/eos5gge
Now that your new model respository has been created, you are ready to start contributing to it!
Here are some brief starter steps for contributing to your new model repository:
Note: Many of the bullet points below will have extra links if this is your first time contributing to a GitHub repository
README.md
file to accurately describe your modelIf you have any questions, please feel free to open an issue and get support from the community!
After reading through the publication, Dataset Used
DILI Toxicity dataset: This combines the DILIst and the DILIrank datasets. The datasets were combined into a standard format: DILI positive and DILI negative. The compound smiles were standardized using the MolVS Standardizer, which resulted in the GOLD standard DILI dataset. It comprises 1,275 compounds (820 toxic and 455 non-toxic).
Proxy DILI dataset: eleven proxy DILI labels, including the Pharmacokinetics parameters, were combined to give 18,679 compounds. After smiles standardization, we have the Proxy DILI-Dataset comprising 15,080 compounds.
Features used
Models
The best-performing model was the combination of all three feature spaces. All predictions were evaluated using the;
@GemmaTuron @DhanshreeA
Great Summary @Zainab-ik ! Next steps would be:
Hi @GemmaTuron
Attached is the link to the notebook in the implementation file here.
I've gone through it and tried running it. However, I keep getting the error,
ModuleNotFoundError Traceback (most recent call last)
in () 5 import numpy as np 6 import pandas as pd ----> 7 from rdkit import Chem 8 from rdkit.Chem import inchi 9 from rdkit.Chem.MolStandardize import rdMolStandardize |
ModuleNotFoundError: No module named 'rdkit'
NOTE: If your import is failing due to a missing package, you can manually install dependencies using either !pip or !apt.
What I've tried.
Hi @Zainab-ik
Where are you running the notebook, in your local? Make sure that the right conda environment is active when running the notebook. I recommend using Visual Studio code for that, which makes it easy to integrate code development. The steps you list look good, if in the terminal you:
$ python
import rdkit
Does the package get imported? this will tell you if you are failing at installing the package
Also make sure you have installed the package in the right conda env, not in your base
@GemmaTuron @Zainab-ik anything you need from me?
I did not yet hear back from @Zainab-ik, can you update us?
Hi, There's a change of OS from my side. I'd start all over and update. Thank you @DhanshreeA @GemmaTuron
Update
RdKit works successfully and I started the run all over with the help of @DhanshreeA
While running this For Loop
for s in smiles_list:
smiles = unquote(s)
smiles_r = standardized_smiles(smiles)
test = {'smiles_r': [smiles_r]
}
test = pd.DataFrame(test)
desc=pd.read_csv("all_features_desc.csv", encoding='windows-1252')
molecule = Chem.MolFromSmiles(smiles_r)
#st.image(Draw.MolToImage(molecule), width=200)
test_mfp_Mordred = calc_all_fp_desc(test)
test_mfp_Mordred_liv = predict_liv_all(test_mfp_Mordred)
test_mfp_Mordred_liv_values = test_mfp_Mordred_liv.T.reset_index().rename(columns={"index":"name", 0: "value"})
interpret, y_proba, y_pred = predict_DILI(test_mfp_Mordred_liv)
interpret = pd.merge(interpret, desc, right_on="name", left_on="name", how="outer")
interpret = pd.merge(interpret, test_mfp_Mordred_liv_values, right_on="name", left_on="name", how="inner")
print(y_proba[0])
print(y_pred[0])
if(y_pred[0]==1):
print("The compound is predicted DILI-Positive")
if(y_pred[0]==0):
print("The compound is predicted DILI-Negative")
print("unbound Cmax: ", np.round(10**-test_mfp_Mordred_liv["median pMolar unbound plasma concentration"][0] *10**6, 2), "uM")
print("total Cmax: ", np.round(10**-test_mfp_Mordred_liv["median pMolar total plasma concentration"][0] *10**6, 2), "uM")
print("Most contributing MACCS substructure to DILI toxicity")
top = interpret[interpret["SHAP"]>0].sort_values(by=["SHAP"], ascending=False)
proxy_DILI_SHAP_top = pd.merge(info, top[top["name"].isin(liv_data)])
proxy_DILI_SHAP_top["pred"] = proxy_DILI_SHAP_top["value"]>0.50
proxy_DILI_SHAP_top["SHAP contribution to Toxicity"] = "Positive"
proxy_DILI_SHAP_top["smiles"] = smiles_r
top_positives = top[top["value"]==1]
top_MACCS= top_positives[top_positives.name.isin(desc.name.to_list()[-166:])].iloc[:1, :]["description"].values[0]
top_MACCS_value= top_positives[top_positives.name.isin(desc.name.to_list()[-166:])].iloc[:1, :]["value"].values[0]
top_MACCS_shap= top_positives[top_positives.name.isin(desc.name.to_list()[-166:])].iloc[:1, :]["SHAP"].values[0]
top_MACCSsubstructure = Chem.MolFromSmarts(top_MACCS)
Draw.MolToImage(molecule, highlightAtoms=molecule.GetSubstructMatch(top_MACCSsubstructure), width=400)
print("Presence of this substructure contributes", np.round(top_MACCS_shap, 4), "to prediction")
print("Most contributing MACCS substructure to DILI safety")
bottom = interpret[interpret["SHAP"]<0].sort_values(by=["SHAP"], ascending=True)
proxy_DILI_SHAP_bottom = pd.merge(info, bottom[bottom["name"].isin(liv_data)])
proxy_DILI_SHAP_bottom["pred"] = proxy_DILI_SHAP_bottom["value"]>0.50
proxy_DILI_SHAP_bottom["SHAP contribution to Toxicity"] = "Negative"
proxy_DILI_SHAP_bottom["smiles"] = smiles_r
bottom_positives = bottom[bottom["value"]==1]
bottom_MACCS= bottom_positives[bottom_positives.name.isin(desc.name.to_list()[-166:])].iloc[:1, :]["description"].values[0]
bottom_MACCS_value= bottom_positives[bottom_positives.name.isin(desc.name.to_list()[-166:])].iloc[:1, :]["value"].values[0]
bottom_MACCS_shap= bottom_positives[bottom_positives.name.isin(desc.name.to_list()[-166:])].iloc[:1, :]["SHAP"].values[0]
bottom_MACCSsubstructure = Chem.MolFromSmarts(bottom_MACCS)
Draw.MolToImage(molecule, highlightAtoms=molecule.GetSubstructMatch(bottom_MACCSsubstructure), width=400)
print("Presence of this substructure contributes", np.round(bottom_MACCS_shap, 4), "to prediction")
I got the error
ValueError Traceback (most recent call last)
[<ipython-input-48-bb1e4f39d195>](https://localhost:8080/#) in <cell line: 1>()
14
15 test_mfp_Mordred = calc_all_fp_desc(test)
---> 16 test_mfp_Mordred_liv = predict_liv_all(test_mfp_Mordred)
17 test_mfp_Mordred_liv_values = test_mfp_Mordred_liv.T.reset_index().rename(columns={"index":"name", 0: "value"})
18
6 frames
[/usr/local/lib/python3.10/dist-packages/sklearn/utils/validation.py](https://localhost:8080/#) in _assert_all_finite(X, allow_nan, msg_dtype, estimator_name, input_name)
159 "#estimators-that-handle-nan-values"
160 )
--> 161 raise ValueError(msg_err)
162
163
ValueError: Input X contains NaN.
RandomForestClassifier does not accept missing values encoded as NaN natively. For supervised learning, you might want to consider sklearn.ensemble.HistGradientBoostingClassifier and Regressor which accept missing values encoded as NaNs natively. Alternatively, it is possible to preprocess the data, for instance by using an imputer transformer in a pipeline or drop samples with missing values. See https://scikit-learn.org/stable/modules/impute.html You can find a list of all estimators that handle NaN values at the following page: https://scikit-learn.org/stable/modules/impute.html#estimators-that-handle-nan-values
I checked the input data and got
smiles_r 0
Mfp0 0
Mfp1 0
Mfp2 0
Mfp3 0
..
WPol 0
Zagreb1 0
Zagreb2 0
mZagreb1 0
mZagreb2 0
Length: 3844, dtype: int64
And I used the SimpleImputer and dropna() functions and the same error persists. This is the link to the notebook - here @DhanshreeA
Hi @Zainab-ik and @DhanshreeA This is still work in progress, any update? Should I assign this model to one of the new interns or you want to continue working on this @Zainab-ik ?
Hi @Zainab-ik and @DhanshreeA This is still work in progress, any update? Should I assign this model to one of the new interns or you want to continue working on this @Zainab-ik ?
Hi @GemmaTuron I'd like to continue working on it.
Awesome @Zainab-ik, let me know if you need any help
Hello @Zainab-ik
what is the status of this? Please let us know if you have capacity to tackle this because otherwise we will assign it to someone else.
Hello @Zainab-ik
what is the status of this? Please let us know if you have capacity to tackle this because otherwise we will assign it to someone else.
Hi @GemmaTuron This can be reassigned. Apologies I couldn't get on with it.
I am going to try out model incorporation with the new ersilia template
/approve
@ (or other maintainers) the /approve
workflow has failed. View the logs here for more information:
π Workflow logs
You may need to delete the following repo that was created via this workflow run since the run was not fully successful: ersilia-os/eos3n69
I have deleted the repository @DhanshreeA please check and amend what is failing
/approve
@ (or other maintainers) the /approve
workflow has failed. View the logs here for more information:
π Workflow logs
You may need to delete the following repo that was created via this workflow run since the run was not fully successful: ersilia-os/eos6ubs
I've deleted this second repository. I could easily install pyyaml
but the error now I do not want to touch as I am unfamiliar with the new eos-template completely. @DhanshreeA let me know
I have meanwhile used the eos5gge repo to build the model. Maybe it is best we leave that one as completed and we think which models we want to reformat. The ones we use the most, probably. This can also be a good task for Outreachy applicants
/approve
@Zainab-ik ersilia model respository has been successfully created and is available at:
π ersilia-os/eos7ioj
Now that your new model respository has been created, you are ready to start contributing to it!
Here are some brief starter steps for contributing to your new model repository:
Note: Many of the bullet points below will have extra links if this is your first time contributing to a GitHub repository
README.md
file to accurately describe your modelIf you have any questions, please feel free to open an issue and get support from the community!
We will not re factor this model into the new one at the moment. I have deleted the repository
Model Name
Early prediction of Drug-Induced Liver Injury
Model Description
The DILI-Predictor predicts 11 features related to DILI toxicity, including in-vivo and in-vitro and physicochemical parameters. It has been developed by the Broad Institute, using the DILIst dataset (1020 compounds) from the FDA, and achieved an accuracy balance of 70% on a test set of 255 compounds held out from the same dataset. The authors show how the model can correctly predict compounds that are not toxic in human despite being toxic in mice.
Slug
DILI-predictor
Tag
Toxicity, Metabolism
Publication
https://www.biorxiv.org/content/10.1101/2024.01.10.575128v1.full
Source Code
https://github.com/srijitseal/DILI
License
MIT