🦠 Model Request: Prediction of clinically relevant drug‐induced liver injury from structure using machine learning

leilayesufu commented 11 months ago

Model Name

Drug-induced liver injury prediction

Model Description

Prediction of clinically relevant drug-induced-liver-injury (DILI), based solely on drug structure using binary classification methods. The results presented here are useful as a screening tool both in a clinical setting, in the assessment of DILI as well as in the early stages of drug development to rule out potentially hepatotoxic candidates.

Slug

dili-pred

Tag

Metabolism, Toxicity, Cytotoxicity

Publication

https://pubmed.ncbi.nlm.nih.gov/30325042/

Source Code

https://github.com/cptbern/QSAR_DILI_2019

License

None

GemmaTuron commented 10 months ago

Hi @leilayesufu I am working on the code, I have a few questions:

when using padelpy to create the descriptors, a few values were too large or Nan, so there needs to be some processing to impute those values. What procedure did you use to do so? Otherwise it is impossible to train the models.
Did you follow, as we said, the rationale of using always the same train test splits? I cannot find these files anywhere. To prevent further bias, it is useful to be consistent on the split you are using to compare several methods. I have started working in the notebooks, see 00 and 01 and please make sure you understand the code, and ask any questions you have. Finally, I remark again the importance of using datapaths instead of typing the paths each time, it prevents errors and helps when paths change. I have not finished the work, but currently with Padel descriptors i only get an auroc of 0.66, wehich is lower than what you reported.

leilayesufu commented 10 months ago

Hi @GemmaTuron i used this code to generate the descriptors.

import csv
from padelpy import from_smiles

def process_csv(input_csv, output_csv):
    with open(input_csv, 'r') as file:
        reader = csv.reader(file)
        data = list(reader)

    smiles_to_process = []
    for row in data:
        smiles = row[0]
        smiles_to_process.append(smiles)

    valid_smiles = []
    for smiles in smiles_to_process:
        try:
            descriptors = from_smiles(smiles)
            valid_smiles.append(smiles)
        except:
            pass

    with open(output_csv, 'w', newline='') as file:
        writer = csv.writer(file)
        for smiles in valid_smiles:
            descriptors = from_smiles(smiles)
            writer.writerow([smiles] + descriptors)

if __name__ == "__main__":
    input_csv = 'input.csv'
    output_csv = 'output.csv'
    process_csv(input_csv, output_csv)

I did this in sections and i concatenated it when i was done.

These were my results AutoML with padel had 0.66 while Autogluon with padel had 0.71.

I was going through the notebooks and i came across this line of code model = lq.ErsiliaBinaryClassifier(time_budget_sec=600, estimator_list=["rf", "lgbm", "xgboost"])

But when running it initially i used strattifiedKfoldclasssifier with a time budget of 1200

GemmaTuron commented 10 months ago

Hi @leilayesufu When I use the padelpy, some values are too large and I have had to impute them, otherwise they cannot be used for model training. I am surprised you did not face the same issue. For the Ersilia & Mordred I am testing the lazyqsar, which is basically the same as AutoMl, for consistency with other models in the hub. I've set a reduced time (600 instead of 1200) to prevent overfitting but we can try that as well if we think will improve significantly the results. Please revise the portion of code that deals with smiles processing and so, to make sure you follow the logic for future models. thanks!

leilayesufu commented 10 months ago

@GemmaTuron i changed the training time from 600s to 1200s and i still got around the same results. i have followed the logic you set and done it the remaining models and i have created a PR

GemmaTuron commented 9 months ago

Hi @leilayesufu

I see some things I do not understand in your code, for example the catboost folder?

leilayesufu commented 9 months ago

@GemmaTuron That folder was created when i was trying to run the autoML, i have updated the PR

GemmaTuron commented 9 months ago

Hi @leilayesufu

A few pointers to continue making the work easier to follow by everyone, hope you find them useful!

Using %%capture allows for reduced output size in the notebooks, nicer to follow
We avoid as much as possible saving the embeddings, these occupy space and can be calculated on the spot (I am referring to the embeddings on the train test splitS). I've only saved the padel descs because they take 1h to calculate
I am finalising the autogluon runs, after your tests I think it does not make sense to go up to 1200s for training as performance does not improve much and we might be overfitting
I'll let you know once the final models are ready and we can then discuss what we do with the performance, which is not great - let's continue meanwhile the work with other DILI models!

GemmaTuron commented 9 months ago

Other pointers I am finding while working on the code

When you are obtaining embeddings, or fingerprints, and want to have a dataframe, you need a single column per datapoint: tdc_data['embeddings'] = tdc_data['smiles'].apply(lambda x: model.transform([x])[0]) - this will create a single "embeddings" colum, whereas you'd like to have something like: X_train = pd.DataFrame(X_train, columns=["eosce_{}".format(i) for i in range(len(X_train[0]))])

GemmaTuron commented 9 months ago

Hi @leilayesufu !

I have finished the model refactoring, changes are pushed to the model repo. Please revise the graphs I have produced and whether they make sense. I am still not convinced because the authors report AUROC of 0.89 and we cannot get to these values, also the models consistently perform better on the test sets than on the train_test splits, which is unusual

leilayesufu commented 9 months ago

@GemmaTuron I have gone through the notebooks, i thinkthe Autogluon models performs better than the Automl models. Ersilia with Autogluon has 0.68, and padel with Autogluon has an auroc 0.69. I also noticed the validation on test sets performs significantly better when i was running it.

Since the external validation using a separate dataset and achieved an AUROC of above 0.80, this suggests that it performs well on unseen data.

GemmaTuron commented 9 months ago

Let's try one last thing to see if we can improve the cross-validation exercise:

Apply robust scaling to Padel descriptors
Reduce Padel descriptors to 100 (dimensionality reduction) - select K best

Please make sure to not modify the current test-train split (do not run the cell again)

miquelduranfrigola commented 9 months ago

As a comment to the above, the simple sklearn classes RobustScaler and SelectKBest should be enough.

leilayesufu commented 9 months ago

@GemmaTuron Good afternoon. I used both f_classif and mutual_info_classif and performed with k=100 and k=500, on both AutoML and Autogluon using this. https://github.com/ersilia-os/ersilia/issues/931#issuecomment-1932384872

The results i got were still around the range of the results we have been getting.

this is a snippet

    scaler = RobustScaler()
    X_train_scaled = scaler.fit_transform(X_train)
    X_test_scaled = scaler.transform(X_test)

    # Step 2: Reduce Padel Descriptors to 100 using SelectKBest
    k_best_selector = SelectKBest(score_func=mutual_info_classif, k=100)  
    X_train_selected = k_best_selector.fit_transform(X_train_scaled, y_train)
    X_test_selected = k_best_selector.transform(X_test_scaled)

FOR the AutoML

f_classif 100features- 0.68 +0.03 f_classif 500features= 0.66 +0.04

mutual_info_classif 100 features= 0.67 +0.05 mutual_info_classif 500 features= 067 + 0.03

For the AutoGluon

f_classif 100features= 0.68 + 0.05 f_classif 500features= 0.68+0.03

mutual_info_classif 100 features= 0.66 + 0.05 mutual_info_classif 500 features= 0.68+ 0.03

I have created a PR with the figures

miquelduranfrigola commented 9 months ago

Thanks @leilayesufu. This is useful. @GemmaTuron it seems we are reaching a dead-end here. What is your opinion?

GemmaTuron commented 9 months ago

yes, I'll write to the authors requesting the original checkpoints and if we do not get them, I'd remove this model, unfortunately

GemmaTuron commented 9 months ago

Should we archive this model @DhanshreeA @miquelduranfrigola ?

miquelduranfrigola commented 8 months ago

I think we hit a dead en here. I would archive this model. Note that archiving repos may be inconvenient at times since we'll lose most git functionalities. Should we just label this as archived in AirTable?

GemmaTuron commented 8 months ago

I have added the archived label in Airtable In this case, I'd really archive the repo as we will not work on that any more. I'll close this issue as well!

ersilia-os / ersilia