ersilia-os / ersilia

The Ersilia Model Hub, a repository of AI/ML models for infectious and neglected disease research.
https://ersilia.io
GNU General Public License v3.0
225 stars 146 forks source link

šŸ¦  Model Request: Prediction of clinically relevant drugā€induced liver injury from structure using machine learning #931

Closed leilayesufu closed 8 months ago

leilayesufu commented 11 months ago

Model Name

Drug-induced liver injury prediction

Model Description

Prediction of clinically relevant drug-induced-liver-injury (DILI), based solely on drug structure using binary classification methods. The results presented here are useful as a screening tool both in a clinical setting, in the assessment of DILI as well as in the early stages of drug development to rule out potentially hepatotoxic candidates.

Slug

dili-pred

Tag

Metabolism, Toxicity, Cytotoxicity

Publication

https://pubmed.ncbi.nlm.nih.gov/30325042/

Source Code

https://github.com/cptbern/QSAR_DILI_2019

License

None

GemmaTuron commented 10 months ago

Hi @leilayesufu I am working on the code, I have a few questions:

leilayesufu commented 10 months ago

Hi @GemmaTuron i used this code to generate the descriptors.

import csv
from padelpy import from_smiles

def process_csv(input_csv, output_csv):
    with open(input_csv, 'r') as file:
        reader = csv.reader(file)
        data = list(reader)

    smiles_to_process = []
    for row in data:
        smiles = row[0]
        smiles_to_process.append(smiles)

    valid_smiles = []
    for smiles in smiles_to_process:
        try:
            descriptors = from_smiles(smiles)
            valid_smiles.append(smiles)
        except:
            pass

    with open(output_csv, 'w', newline='') as file:
        writer = csv.writer(file)
        for smiles in valid_smiles:
            descriptors = from_smiles(smiles)
            writer.writerow([smiles] + descriptors)

if __name__ == "__main__":
    input_csv = 'input.csv'
    output_csv = 'output.csv'
    process_csv(input_csv, output_csv)

I did this in sections and i concatenated it when i was done.

These were my results AutoML with padel had 0.66 while Autogluon with padel had 0.71.

Screenshot 2024-01-29 132725

I was going through the notebooks and i came across this line of code model = lq.ErsiliaBinaryClassifier(time_budget_sec=600, estimator_list=["rf", "lgbm", "xgboost"])

But when running it initially i used strattifiedKfoldclasssifier with a time budget of 1200

GemmaTuron commented 10 months ago

Hi @leilayesufu When I use the padelpy, some values are too large and I have had to impute them, otherwise they cannot be used for model training. I am surprised you did not face the same issue. For the Ersilia & Mordred I am testing the lazyqsar, which is basically the same as AutoMl, for consistency with other models in the hub. I've set a reduced time (600 instead of 1200) to prevent overfitting but we can try that as well if we think will improve significantly the results. Please revise the portion of code that deals with smiles processing and so, to make sure you follow the logic for future models. thanks!

leilayesufu commented 10 months ago

@GemmaTuron i changed the training time from 600s to 1200s and i still got around the same results. i have followed the logic you set and done it the remaining models and i have created a PR

GemmaTuron commented 9 months ago

Hi @leilayesufu

I see some things I do not understand in your code, for example the catboost folder?

leilayesufu commented 9 months ago

@GemmaTuron That folder was created when i was trying to run the autoML, i have updated the PR

GemmaTuron commented 9 months ago

Hi @leilayesufu

A few pointers to continue making the work easier to follow by everyone, hope you find them useful!

GemmaTuron commented 9 months ago

Other pointers I am finding while working on the code

When you are obtaining embeddings, or fingerprints, and want to have a dataframe, you need a single column per datapoint: tdc_data['embeddings'] = tdc_data['smiles'].apply(lambda x: model.transform([x])[0]) - this will create a single "embeddings" colum, whereas you'd like to have something like: X_train = pd.DataFrame(X_train, columns=["eosce_{}".format(i) for i in range(len(X_train[0]))])

GemmaTuron commented 9 months ago

Hi @leilayesufu !

I have finished the model refactoring, changes are pushed to the model repo. Please revise the graphs I have produced and whether they make sense. I am still not convinced because the authors report AUROC of 0.89 and we cannot get to these values, also the models consistently perform better on the test sets than on the train_test splits, which is unusual

leilayesufu commented 9 months ago

@GemmaTuron I have gone through the notebooks, i thinkthe Autogluon models performs better than the Automl models. Ersilia with Autogluon has 0.68, and padel with Autogluon has an auroc 0.69. I also noticed the validation on test sets performs significantly better when i was running it.

Since the external validation using a separate dataset and achieved an AUROC of above 0.80, this suggests that it performs well on unseen data.

GemmaTuron commented 9 months ago

Let's try one last thing to see if we can improve the cross-validation exercise:

  1. Apply robust scaling to Padel descriptors
  2. Reduce Padel descriptors to 100 (dimensionality reduction) - select K best

Please make sure to not modify the current test-train split (do not run the cell again)

miquelduranfrigola commented 9 months ago

As a comment to the above, the simple sklearn classes RobustScaler and SelectKBest should be enough.

leilayesufu commented 9 months ago

@GemmaTuron Good afternoon. I used both f_classif and mutual_info_classif and performed with k=100 and k=500, on both AutoML and Autogluon using this. https://github.com/ersilia-os/ersilia/issues/931#issuecomment-1932384872

The results i got were still around the range of the results we have been getting.

this is a snippet

    scaler = RobustScaler()
    X_train_scaled = scaler.fit_transform(X_train)
    X_test_scaled = scaler.transform(X_test)

    # Step 2: Reduce Padel Descriptors to 100 using SelectKBest
    k_best_selector = SelectKBest(score_func=mutual_info_classif, k=100)  
    X_train_selected = k_best_selector.fit_transform(X_train_scaled, y_train)
    X_test_selected = k_best_selector.transform(X_test_scaled)

FOR the AutoML

f_classif 100features- 0.68 +0.03 f_classif 500features= 0.66 +0.04

mutual_info_classif 100 features= 0.67 +0.05 mutual_info_classif 500 features= 067 + 0.03

For the AutoGluon

f_classif 100features= 0.68 + 0.05 f_classif 500features= 0.68+0.03

mutual_info_classif 100 features= 0.66 + 0.05 mutual_info_classif 500 features= 0.68+ 0.03

I have created a PR with the figures

miquelduranfrigola commented 9 months ago

Thanks @leilayesufu. This is useful. @GemmaTuron it seems we are reaching a dead-end here. What is your opinion?

GemmaTuron commented 9 months ago

yes, I'll write to the authors requesting the original checkpoints and if we do not get them, I'd remove this model, unfortunately

GemmaTuron commented 9 months ago

Should we archive this model @DhanshreeA @miquelduranfrigola ?

miquelduranfrigola commented 8 months ago

I think we hit a dead en here. I would archive this model. Note that archiving repos may be inconvenient at times since we'll lose most git functionalities. Should we just label this as archived in AirTable?

GemmaTuron commented 8 months ago

I have added the archived label in Airtable In this case, I'd really archive the repo as we will not work on that any more. I'll close this issue as well!