🦠 Model Request: Prediction of clinically relevant drug‐induced liver injury from structure using machine learning

leilayesufu commented 10 months ago

Model Name

Drug-induced liver injury prediction

Model Description

Prediction of clinically relevant drug-induced-liver-injury (DILI), based solely on drug structure using binary classification methods. The results presented here are useful as a screening tool both in a clinical setting, in the assessment of DILI as well as in the early stages of drug development to rule out potentially hepatotoxic candidates.

Slug

dili-pred

Tag

Metabolism, Toxicity, Cytotoxicity

Publication

https://pubmed.ncbi.nlm.nih.gov/30325042/

Source Code

https://github.com/cptbern/QSAR_DILI_2019

License

None

leilayesufu commented 10 months ago

@GemmaTuron Good morning

GemmaTuron commented 10 months ago

/approve

github-actions[bot] commented 10 months ago

New Model Repository Created! 🎉

@leilayesufu ersilia model respository has been successfully created and is available at:

🔗 ersilia-os/eos7e3s

Next Steps ⭐

Now that your new model respository has been created, you are ready to start contributing to it!

Here are some brief starter steps for contributing to your new model repository:

Note: Many of the bullet points below will have extra links if this is your first time contributing to a GitHub repository

🍴 Get started by creating a fork of your new model repository - docs
👯 Clone your forked repository - docs
✏️ Make edits to your new forked model repository - docs - Edits might include:
- Updating the README.md file to accurately describe your model
- Add source code for your model
- Adding documentation for your model
🚀 Open a Pull Request from your forked repository to the original repository. This will allow you to bring your local changes into the new ersilia model repository that was just created! - docs

Additional Resources 📚

If you have any questions, please feel free to open an issue and get support from the community!

DhanshreeA commented 10 months ago

Hi @leilayesufu the original contributor to the DILI issue did some preprocessing with the dataset given by the publishers for this model. They have kindly shared that pre processed dataset as well as notebooks to the models they tried training. Refer to this comment here: https://github.com/ersilia-os/ersilia/issues/508#issuecomment-1869785133

My first recommendation would be to :

simply run these notebooks and see if you are able to reproduce these results and understand some of this code since this new to you.
After that, I would recommend referring to this comment by @GemmaTuron : https://github.com/ersilia-os/ersilia/issues/508#issuecomment-1365826188 and this comment: https://github.com/ersilia-os/ersilia/issues/508#issuecomment-1372117893 For this task, you would first need to install TDC and get familiar with using it (this is not hard at all and their documentation is a good starting point).

Please post your updates here when you try to do these two tasks. We will take on model incorporation after that.

leilayesufu commented 10 months ago

Good morning @GemmaTuron @DhanshreeA While trying to run the Automl notebook it's requesting for a pickle file which is unavailable. Here's my notebook. https://colab.research.google.com/drive/1ktHRR1LD_K3per6we_byb55Ea6iT279C?usp=sharing

GemmaTuron commented 10 months ago

Hi @leilayesufu, did you try to search for the pkl file, which is the actual model, in the author's repo?

leilayesufu commented 10 months ago

Yes, i did. It's not in the author's repository

DhanshreeA commented 10 months ago

@leilayesufu remember the authors have not shared any models. Their key contribution at least within that repo is just the dataset. Likely the pickle file is coming from another source.

DhanshreeA commented 10 months ago

Hi @leilayesufu in the AutoML notebook, (or in Autogluon as well, I am not sure), you can skip working with Word2Vec/mol2vec since we cannot confirm which embeddings file was used by the previous contributor, and currently mol2vec is archived and no longer actively maintained.

You can try featurizing the smiles using ersilia embeddings. You can use them as documented in this repository here: https://github.com/ersilia-os/compound-embedding-lite. Another thing you can try is using Morgan Fingerprints using RDKit. And you can train two automl models as done in the notebook, one using ersilia embeddings and another using Morgan. Please report the results here after.

I think you can do the same within the Autogluon notebook as well. If we have decent results, we can proceed, otherwise we can look at training a different model :D

GemmaTuron commented 10 months ago

Thanks @DhanshreeA that is a good suggestion - the mol2vec library is not well maintained and it will be difficult to keep using it. @leilayesufu let us know if you can follow the next steps or you need more guidance. I suggest looking at lazy-qsar library - your task is very similar to what @Richiio needs to work on so you can also join forces and work together in this!

leilayesufu commented 10 months ago

Thank you @DhanshreeA I have implemented the two notebooks featurizing the smiles using ersilia embeddings and using morgan fingerprints. i also removed the Word2Vec/mol2vec https://colab.research.google.com/drive/1ktHRR1LD_K3per6we_byb55Ea6iT279C?usp=sharing https://colab.research.google.com/drive/14aicUQzaH_BG9YZLtVnJ1sB1EvVG02j6?usp=sharing

The model's performance is consistent across different embedding types (Ersilia embeddings and Morgans fingerprint)

On this comment, i have cheked the overlap between the benchmark dataset from TDCommons and the model's data. There is an overlap of 6.31% as seen here

GemmaTuron commented 10 months ago

Hi @leilayesufu

Thansk for sharing this. Can you revise the results you are getting and provide an interpretation? Do you think we have a good model here? Maybe plotting an AUROC graph can also help in the interpretation.

Also please, in addition to running the notebooks you should explain what you are doing in each, where is the data coming from, what is the proportion of actives and inactives in your data, what is the split you are using for validation etc

leilayesufu commented 10 months ago

Notebook 1: FLAML AutoML Model

Packages Installation: The notebook begins by installing necessary packages using pip, including RDKit, FLAML. Data Loading: Data is loaded from two CSV files: 'dilismiles.csv' and 'dili_padel_2d.csv'. The former contains Simplified Molecular Input Line Entry System (SMILES) strings, and the latter includes molecular descriptors and outcomes. Data Preprocessing: RDKit is used to generate molecules from SMILES and Morgan fingerprints and ersilia embeddings from molecules. The target variable is 'Outcome', representing the binary classification task. AutoML Training: FLAML's AutoML is employed for automated model training using K-Fold cross-validation (K=5). The model is evaluated on each fold, and AUROC is printed.

Notebook 2: AutoGluon Model

Packages Installation: The notebook starts by upgrading pip, setuptools, and installing necessary packages, including AutoGluon. Data Loading: The 'dili_padel_2d.csv' file is loaded into a TabularDataset, and the features (X) and target variable ('Outcome') are defined. Data Splitting: The dataset is split into training and testing sets using the train_test_split function from sklearn. AutoML Training: AutoGluon's TabularPredictor is used to fit the model on the training data with a time limit and 'best_quality' preset. The model performance is then evaluated on the test set, and various metrics, including accuracy, precision, recall, ROC AUC, F1 score, etc., are printed.

Data Details:

Dataset The final set for the creation of the machine‐learning models contains 384 (66.8%) DILI‐positive drugs and 191 (33.2%) DILI‐negative drugs (total n = 575). overall Tanimoto similarity index value was fairly low at 0.24, indicating a heterogeneous dataset based on the descriptors employed

Interpretation

In evaluating the AutoGluon model, the reported performance metrics suggest a reasonably effective model, particularly in terms of recall and ROC AUC. The ROC AUC of the AutoGluon model of 74.5% indicates good discriminatory power On the other hand, the FLAML AutoML model's performance, as indicated by AUROC values across folds, appears less consistent. The value hovers around 0.5, suggesting limited predictive ability

DhanshreeA commented 10 months ago

Hi @leilayesufu

Thank you for the updates. Good work so far. I have a few comments:

I think we need additional information with respect to percent overlap. From your code, I can see that you are getting a percentage of common drugs relative to ALL the data combined (from the paper + TDC). Can you mention the absolute numbers as well? ie how many drugs are there in TDC's dataset, and how many drugs are there in the paper's dataset, and what percentage of drugs relative to the paper's dataset are common with TDC?
In your KFold split, can you specify a random state https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html such that all the model+descriptor scenarios are using the same splits? This would remove any additional room for doubt when comparing model results.
Finally, since drug activity datasets are generally imbalanced, could you look into using StratifiedKFold instead of simple KFold to train these models? https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedKFold.html
It also might be worthwhile looking into lazy-qsar as Gemma has pointed out.

Thanks and please report your findings here.

leilayesufu commented 10 months ago

1) In the TDC's dataset, there are 475 drugs present. While in the paper's dataset. there are 587 drugs. Number of drugs in paper's dataset: 587 Number of drugs in TDC's dataset: 475 Number of common drugs: 67 Percentage of common drugs relative to paper's dataset: 11.41%. This means that, out of all the drugs present in the dataset from the paper (dilismiles.csv), approximately 11.41% of them are also present in the TDC dataset (dili_tdc_dataset.csv).

2) I have updated the notebook to use a random state of 42 in my K-fold, and i have also changed it from simpleKFold to StratifiedKFold. https://github.com/leilayesufu/log_files/blob/master/Untitled.ipynb

3) I have also plotted the auroc curves for the models

GemmaTuron commented 10 months ago

Hi @leilayesufu

Please help me a bit here to understand the steps you are doing:

Put all the code and data you are using in the model repository
Make a nice analysis of the data: what is the % of actives and inactives in the dataset, are you balancing the split into train and test?
I am not sure why you are not using the PADEL descriptors as shared with the authors, any reason to go with the Morgan fingerprints instead?
Remember to always to a cross-validation (5 fold train test split so we can ensure model performance is consistent)
Once we have the final model, evaluated on the dataset, we can evaluate its performance on the TDC data (without the compounds that are already present in the training data)

leilayesufu commented 10 months ago

Good morning. I have uploaded the code into the models repository

The dataset consists of 66.8% DILI-positive drugs and 33.2% DILI-negative drugs. For the AutoMLit implements 5-fold cross-validation using StratifiedKFold. This ensures that the split maintains the same ratio of DILI-positive to DILI-negative instances in both the training and testing sets for each fold. https://github.com/leilayesufu/eos7e3s/blob/main/data_and_notebooks/Copy%20of%20Automl.ipynb

GemmaTuron commented 10 months ago

Hi @leilayesufu

Thanks for this. Some suggestions:

Please keep the folder structure of the repo. You can add the notebooks in /code and create a /data folder under framework
You should plot the 5-fold in a single graph, with the mean AUROC and the boundaries, like for example: https://stackoverflow.com/questions/57708023/plotting-the-roc-curve-of-k-fold-cross-validation
I am still missing the interpretation of the results and what would you do to improve them. How do they compare to what the authors get? Once all this is done, please push the code to the main repo so @DhanshreeA and myself can suggest edits more easily

leilayesufu commented 10 months ago

Good morning @GemmaTuron I have updated the repository here: https://github.com/leilayesufu/eos7e3s

leilayesufu commented 10 months ago

https://github.com/leilayesufu/eos7e3s/blob/main/model/framework/code/Copy%20of%20Automl.ipynb

leilayesufu commented 10 months ago

The above is for the AutoML model, while for the AutoGluon. https://colab.research.google.com/drive/14aicUQzaH_BG9YZLtVnJ1sB1EvVG02j6#scrollTo=x-_mFT1OKGvp The model is being trained but i am unable to plot the graph or evaluate it because of that index error and that has been occcuring no matter how i try to fix it

GemmaTuron commented 10 months ago

Hi @leilayesufu

This is a good start! Please - make sure to add sensible names to the notebooks, cannot have something named Copy ... also is good practice to never leave spaces in the names. To be able to run the code, make sure that all the paths point to the right github folder, not upload with google files. This way we can be sure that we are running the exact same code. For the index error, make sure to go step by step and print the different files, lists, dataframes that you are using because it seems you are messing with one of them at some point as you get None's

Then, I still do not follow if you tried the autoML with padel descriptors or not, and why you chose other descriptors if you did and I did not see that you increase the autoML training time to something more sensible, like 10 minutes - please make a summary when you have all these results. Good job so far but we need to make sure what we choose as the best final model, so let's pin down the results of the different tests ! Thanks for the work

leilayesufu commented 10 months ago

The original notebooks used mol2vec and the mol2vec library is not well maintained so @DhanshreeA suggested we use Morgan fingerprints. https://github.com/ersilia-os/ersilia/issues/931#issuecomment-1878535776

GemmaTuron commented 10 months ago

Hi @leilayesufu Sure, but take into account these notebooks do not come from the original authors, they were built by the first Ersilia contributor who was working on the model, so they should be a guideline only. I agree Mol2vec should not be used, but since the authors use Padel we'd like to use them as well unless there is a strong reason not to. So, both with autogluon and automl pipelines we should compare the performance with PADEL and other descriptors. Please make sure to use sufficient training time for autoML. I hope this is more clear now, but let me know if you need further guidance :)

leilayesufu commented 10 months ago

@GemmaTuron Good morning. After training the AutoGluon model multiple times, I am encountering an error when attempting to plot the AUC-ROC curve. Although I can successfully print the AUC value from the performance object, attempting to plot the curve results in a TypeError related to the index.

The error message indicates a TypeError: '(slice(None, None, None), 1)' is an invalid key. I find it perplexing because the AUC value obtained from performance is 0.5, suggesting that the model is making predictions.

https://github.com/leilayesufu/eos7e3s/blob/main/model/framework/code/01_model_training.ipynb

I tried to debug it the wholw of yesterday but i was faced with the same error

GemmaTuron commented 10 months ago

Hi @leilayesufu

Please open a Pr so I can work on the main repo - before doing that, a few tips from yesterday's meeting:

Do not leave the pip install commands on the notebook, update the dockerfile with the requirements and the exact versions of the packages you are using - I have not seen an update on that and I want to make sure we install the same versions
Use informative file names (without blank spaces in the names). data_file for example is not very informative

Let me know once this is done so we can have a look.

leilayesufu commented 10 months ago

@GemmaTuron I'll work on that right now. I have a question for the generating embeddings using this. https://github.com/ersilia-os/compound-embedding-lite Am i to install it in the dockerfile too?

GemmaTuron commented 10 months ago

Hi @leilayesufu

As this is not the final model, add all the necessary requirements in dockerfile, we will trim it down once we know which requirements we need for the final model to run. The compound embeddings are installable via pip install eosce

GemmaTuron commented 10 months ago

Hi @leilayesufu

Some pointers to continue the work:

Please work on the data processing notebook. Remember we discussed several options to check that the order of smiles is kept between files (there is a placeholder in one cell to fill in), you should choose one and implement it. Also, finish the data analysis, how many pos/neg there are, compare the dataset to the TDC dataset... Most importantly, any data processing must be done here, before going onto model training. I've spotted things like this one on the training notebook, which should be on the data processing:
```
#deleting incomplete cases rows
delrow = ["GalliumNitrate Hydrate","trichloroethylene", "Bromoethanamine", "sodiumbicarbonate", "carbontetrachloride", "Chloroform", "cadmiumchloride", "thioacetamide", "Probucol","dichloroethylene", "Hydrazine", "nitrosamine"]
```
what are these incomplete cases?
On the model training: work to fix the np error in the autogluon and revise the first part of the training, I find it surprising that we are getting a 0.5 AUROC, not the slightest signal Once you are done, write up a good summary here of the results, the model performances you are getting, which model would you choose, how you would validate it...

leilayesufu commented 10 months ago

Model Performances:

AutoML with Morgan fingerprints:

AutoML model trained with Morgan fingerprints using StratifiedKFold cross-validation demonstrated the following average classification report across all folds: Precision:

Class 0: 25% Class 1: 67% Recall:

Class 0: 4% Class 1: 95% F1-score:

Class 0: 6% Class 1: 78% Accuracy: 65%

Macro Average F1-score: 42%

Weighted Average F1-score: 54% The model performed well in predicting Class 1 (high recall) but struggled with Class 0, The mean AUC-ROC score of 0.56 suggests a modest ability of the model to distinguish between the classes.

AutoML with Ersilia embeddings:

AutoML model trained with Ersilia embeddings, using a time budget of 600 seconds and logistic regression with extended iterations, displayed the following average classification report across all folds: Precision:

Class 0: 65% Class 1: 69% Recall:

Class 0: 12% Class 1: 97% F1-score:

Class 0: 21% Class 1: 81% Accuracy: 69%

Macro Average F1-score: 51%

Weighted Average F1-score: 61% The mean AUC-ROC score of 0.69 suggests an enhanced ability of the model to discriminate between the classes compared to the model with Morgan fingerprints.

AutoGluon with Morgan fingerprints:

The AutoGluon model trained with Morgan fingerprints using the TabularPredictor displayed the following performance metrics:

Accuracy: 70.7% Balanced Accuracy: 58.94% Matthews Correlation Coefficient (MCC): 0.2678 ROC AUC: 70.00% F1-score: 81.11% Precision: 70.87% Recall: 94.81%

Something to note: The AutoML Library ROC-AUC: 0.6999666999666999 from the performance library whille when i tried to calculate it manually i got Manually Calculated ROC-AUC: 0.5316455696202532

AutoGluon with Ersilia embeddings:

The AutoGluon model trained with Ersilia embeddings using the TabularPredictor displayed the following performance metrics:

Accuracy: 68.1% Balanced Accuracy: 50.0% Matthews Correlation Coefficient (MCC): 0.0 ROC AUC: 50.0% F1-score: 81.03% Precision: 68.10% Recall: 100.0%

Model selection

Considering the overall performance, including accuracy, precision, recall, and F1 score, AutoML with Ersilia embeddings demonstrates strengths in class discrimination and achieves a high F1-score. Autogluon with Morgan Fingerprints also stands out as the most promising model due to its robust accuracy, effective handling of imbalanced data, and strong discrimination ability but the only problem is the difference in the two roc_auc values.

Model validation

The AutoML model with Ersilia embeddings employs k-fold cross-validation during training. This approach entails dividing the dataset into k folds and iteratively training and validating the model on each fold,

leilayesufu commented 10 months ago

I've opened a pull request containing the model training notebook and the figures of the roc_auc curves and the confusion matrix

GemmaTuron commented 10 months ago

Hi @leilayesufu

Thanks for the report! I am still missing the comparison between both files (dilismiles and dilipadel) that makes sure the order between molecules is maintained in both - please do that in the designated cell #compare that the order of molecules remains the same between both files

I do not see that you are using the updated dili padel file, if you create and clean it up that is what you should be using.

Once these tasks above are completed, train the models with all the data and check how they do in the TDC dataset (with the repeated smiles deleted from the TDC) that will be our final validation of model perfomance.

leilayesufu commented 10 months ago

@GemmaTuron Good morning. I've created a PR. I retrained the models using the updated dilismileswithoutcome csv file. Then i have created a notebook for model validation, where i used the TDC dataset excluding the ones in the training datasets to validate the models.

My pending task: #compare that the order of molecules remains the same between both files I was using a function to get the iupac from smiles and i wasn't quite getting it right. I'll use the STOUT method and update you on my progress

leilayesufu commented 9 months ago

Good day. @GemmaTuron I wanted to inform you that the original authors shared an additional dataset containing both smiles and common names. You can access it here. I used this dataset to compare the other two datasets (the one with just smiles and the one with just CommonName) and the smiles correlate to the common name.

I trained the AutoML model with the provided padel descriptors, Morgan fingerprints, and Ersilia embeddings with 5 folds. Additionally, I trained the autoGluon model using padel descriptors, Morgan fingerprints, and Ersilia embeddings.

AutoML with ersilia embeddings:

Mean ROC- 0.69
Average Classification Report Across All Folds:

              precision    recall  f1-score   support

           0       0.65      0.12      0.21       194
           1       0.69      0.97      0.81       394

    accuracy                           0.69       588
   macro avg       0.67      0.55      0.51       588
weighted avg       0.68      0.69      0.61       588

AutoML with Morgan fingerprints

Mean ROC- 0.56
Average Classification Report Across All Folds:

              precision    recall  f1-score   support

           0       0.25      0.04      0.06       194
           1       0.67      0.95      0.78       394

    accuracy                           0.65       588
   macro avg       0.46      0.49      0.42       588
weighted avg       0.53      0.65      0.54       588

Autogluon with morgan fingerprints

{'accuracy': 0.6271186440677966, 'balanced_accuracy': 0.5750674258315853, 'mcc': 0.17496736213265582, 'f1': 0.728395061728395, 'precision': 0.6483516483516484, 'recall': 0.8309859154929577}
ROC: 0.64

Autogluon with ersilia embeddings

{'accuracy': 0.6016949152542372, 'balanced_accuracy': 0.5, 'mcc': 0.0, 'roc_auc': 0.5, 'f1': 0.7513227513227513, 'precision': 0.6016949152542372, 'recall': 1.0}

AutoML with padel descriptors

Mean Roc: 0.66
Average Classification Report Across All Folds:

              precision    recall  f1-score   support

           0       0.55      0.24      0.33       194
           1       0.71      0.90      0.79       394

    accuracy                           0.68       588
   macro avg       0.63      0.57      0.56       588
weighted avg       0.65      0.68      0.64       588

Autogluon with padel descriptors

Roc: 0.71
{'accuracy': 0.7796610169491526, 'balanced_accuracy': 0.6637004078605858, 'mcc': 0.37672523793906193, 'roc_auc': 0.7122728958101594, 'f1': 0.8586956521739131, 'precision': 0.8144329896907216, 'recall': 0.9080459770114943}

GemmaTuron commented 9 months ago

Hi @leilayesufu Thanks for the update. The best combination indeed seems the autogluon with padel descriptors. In that case, we'll need to try out the pypadel package (you can look at the implementation in the Ersilia Model Hub) to check if we obtain the same descriptors that the authors use. So, next steps:

try to convert smiles to PADEL (since we won't have the descriptors precalculated for new inputs) and compare that we are obtaining the exact same descriptors and in the same order as the ones used in model training
train a final model with all the data (train and test) and save it - re use parameters from autogluon
Predict the result on the test set from TDC to see how well the model performs in a new chemical space

leilayesufu commented 9 months ago

@GemmaTuron Good evening. I used padelpy to convert smiles to padel and i got padel descriptors as seen in the file below.
descriptors.csv Unfortunately, it is not the exact same descriptors and in the same order as the one from the padel file provided. dili_padel_2d.csv

I also retrained the AutoML with ersilia embeddings, i changed modelAutoML = AutoML(task="classification", time_budget=600, logistic_max_iter=40000) to modelAutoML = AutoML(task="classification", time_budget=600, logistic_max_iter=80000) and i got better results with almost the same results as the Autogluon with padel descriptors.

Mean roc across folds: 0.70
Average Classification Report Across All Folds:

              precision    recall  f1-score   support

           0       0.62      0.10      0.18       194
           1       0.69      0.97      0.80       394

    accuracy                           0.68       588
   macro avg       0.66      0.54      0.49       588
weighted avg       0.67      0.68      0.60       588

I would suggest either using the AutoML with ersilia embeedings or converting the entire smiles to the new padel descriptors with padelpy and training it with Autogluon to see it's performance.

GemmaTuron commented 9 months ago

Hi @leilayesufu

Good, let's try the padelpy with the autogluon and see if that is the best performace we can get, otherwise we might go for the Ersilia embeddings!

miquelduranfrigola commented 9 months ago

Agreed

leilayesufu commented 9 months ago

@GemmaTuron When trying to run padelpy with AutoGluon. A few reports, when i tried using the padelpy script on the entire smiles csv file. i encountered the following error Error: PaDEL-Descriptor failed on one or more mols. Ensure the input structures are correct. This indicates that the padelpy couldn't calculate the descriptors on some smiles, which could prove to be a problem because the outcome column is in the same order as the smiles I modified the script to remove smiles that couldn't be calculated by the padelpy script and tried running it several times and i got the following error.

RuntimeError: PaDEL-Descriptor encountered an error: PaDEL-Descriptor timed out during subprocess call

from my research, this was due to either due to either the complexity of the molecular structures, the size of the dataset, or resource limitations. I then tried the script with the first 10 smiles from the dilismiles csv file, i could successfully get the padel descriptors for the first 10 descriptors_with_outcome_subset.csv

So i tried testing the Autogluon with those first 10 descriptors and i got the error, the model trained and saved but i couldn't calculate the roc_curve because of this error from the Autogluon. ValueError: Only one class present in y_true. ROC AUC score is not defined in that case.

I will try getting the first 30 smiles descriptors from the padelpy script, if retrain the model and get it's performance.

However, i wouldn't recommend this model because we don't know the smiles that the padelpy script would be able to get descriptors for and also if a user passes a large csv file for testing. the padelpy might return a runtime error

GemmaTuron commented 9 months ago

Hi @leilayesufu

The issue with calculating Padel descriptors for so many molecules should be solved with a batch approach, where you pass batches of smiles and keep appending them to a file. It is normal that some cannot be calculated, so you need to add a check for that the same way we do when converting smiles to mols in rdkit. Please work on that and share your suggestions.

miquelduranfrigola commented 9 months ago

Agree. Doing batches of 100 molecules would be perfectly fine.

As an aside, let's keep in mind that if Padel descriptors turn out to be problematic, we can always shift to Mordred descriptors. But I agree that we should first try Padel.

leilayesufu commented 9 months ago

@GemmaTuron Good morning. Doing it in batch of 100 still gave me that error, so i had to do it in batch of 30 and concatenate it. padel_descriptors.csv

I got the following results.

{'accuracy': 0.6470588235294118, 'balanced_accuracy': 0.5457875457875457, 'mcc': 0.18913360395900078, 'roc_auc': 0.7225274725274725, 'f1': 0.7735849056603773, 'precision': 0.640625, 'recall': 0.9761904761904762}

After a thorough evaluation of model results across different descriptor types, the AutoML model with Ersilia Embeddings still stands out as the preferred choice. AutoML with Ersilia Embeddings demonstrate: A balanced precision-recall trade-off. Consistent accuracy across folds.

I have created a pull request. https://github.com/ersilia-os/eos7e3s/pull/6

leilayesufu commented 9 months ago

@GemmaTuron Good evening.

Testing both ersilia embeddings and the padel with Autogluon on the TDC dataset, i also had to change the smiles in batches of 10 molecules to get the padels for the Autogluon. I sucessfully used the models to get the two roc_auc curves in the model_validations notebook ersilia with automl had 0.83 while the padel with autogluon had 082. https://github.com/leilayesufu/eos7e3s/blob/main/model/framework/code/02_model_validation.ipynb

I have opened a PR

GemmaTuron commented 9 months ago

Thanks Leila! Should we go for the Ersilia embeddings given that they are much faster/easier to calculate, even if original authors used the padel descriptors? What are your thoughs @DhanshreeA and @miquelduranfrigola ?

leilayesufu commented 9 months ago

Yes, i think should implement AutoML with Ersilia embeddings. This decision is based on their faster and easier calculation compared to Padel descriptors albeit similar results.

DhanshreeA commented 9 months ago

@GemmaTuron and @leilayesufu as a final experiment, I think it would be nice to use Mordred descriptors and see how the results compare with the experiments performed so far. https://github.com/mordred-descriptor/mordred

I think you can try both AutoML and Autogluon with Mordred, and compare the results across the same metrics you have been using so far. Please also note how long it takes to calculate these descriptors for 1, 10, 100 smiles. And after this we can take a final call as to which model to incorporate within the hub.

As a side note, your model can make it to the TDC leaderboard https://tdcommons.ai/benchmark/admet_group/22dili/ :sweat_smile:

GemmaTuron commented 9 months ago

@DhanshreeA do you refer to Mordred or Morgan? Mordred is quite slow, is the only issue (for the TDC leaderboard you actually need to use their data and their train test splits to train the model, here we are just using the dataset to validate performance)

miquelduranfrigola commented 9 months ago

I think we can go with Ersilia Embeddings!

GemmaTuron commented 9 months ago

Let's go with that then, @leilayesufu train the final model and save it so that it runs with Ersilia. Before opening the PR make sure to clean up all old code and notebooks and leave only the relevant ones

leilayesufu commented 9 months ago

I have created a PR

ersilia-os / ersilia