Closed leilayesufu closed 8 months ago
Hi @leilayesufu I am working on the code, I have a few questions:
Hi @GemmaTuron i used this code to generate the descriptors.
import csv
from padelpy import from_smiles
def process_csv(input_csv, output_csv):
with open(input_csv, 'r') as file:
reader = csv.reader(file)
data = list(reader)
smiles_to_process = []
for row in data:
smiles = row[0]
smiles_to_process.append(smiles)
valid_smiles = []
for smiles in smiles_to_process:
try:
descriptors = from_smiles(smiles)
valid_smiles.append(smiles)
except:
pass
with open(output_csv, 'w', newline='') as file:
writer = csv.writer(file)
for smiles in valid_smiles:
descriptors = from_smiles(smiles)
writer.writerow([smiles] + descriptors)
if __name__ == "__main__":
input_csv = 'input.csv'
output_csv = 'output.csv'
process_csv(input_csv, output_csv)
I did this in sections and i concatenated it when i was done.
These were my results AutoML with padel had 0.66 while Autogluon with padel had 0.71.
I was going through the notebooks and i came across this line of code
model = lq.ErsiliaBinaryClassifier(time_budget_sec=600, estimator_list=["rf", "lgbm", "xgboost"])
But when running it initially i used strattifiedKfoldclasssifier with a time budget of 1200
Hi @leilayesufu When I use the padelpy, some values are too large and I have had to impute them, otherwise they cannot be used for model training. I am surprised you did not face the same issue. For the Ersilia & Mordred I am testing the lazyqsar, which is basically the same as AutoMl, for consistency with other models in the hub. I've set a reduced time (600 instead of 1200) to prevent overfitting but we can try that as well if we think will improve significantly the results. Please revise the portion of code that deals with smiles processing and so, to make sure you follow the logic for future models. thanks!
@GemmaTuron i changed the training time from 600s to 1200s and i still got around the same results. i have followed the logic you set and done it the remaining models and i have created a PR
Hi @leilayesufu
I see some things I do not understand in your code, for example the catboost folder?
@GemmaTuron That folder was created when i was trying to run the autoML, i have updated the PR
Hi @leilayesufu
A few pointers to continue making the work easier to follow by everyone, hope you find them useful!
I'll let you know once the final models are ready and we can then discuss what we do with the performance, which is not great - let's continue meanwhile the work with other DILI models!
Other pointers I am finding while working on the code
When you are obtaining embeddings, or fingerprints, and want to have a dataframe, you need a single column per datapoint:
tdc_data['embeddings'] = tdc_data['smiles'].apply(lambda x: model.transform([x])[0])
- this will create a single "embeddings" colum, whereas you'd like to have something like:
X_train = pd.DataFrame(X_train, columns=["eosce_{}".format(i) for i in range(len(X_train[0]))])
Hi @leilayesufu !
I have finished the model refactoring, changes are pushed to the model repo. Please revise the graphs I have produced and whether they make sense. I am still not convinced because the authors report AUROC of 0.89 and we cannot get to these values, also the models consistently perform better on the test sets than on the train_test splits, which is unusual
@GemmaTuron I have gone through the notebooks, i thinkthe Autogluon models performs better than the Automl models. Ersilia with Autogluon has 0.68, and padel with Autogluon has an auroc 0.69. I also noticed the validation on test sets performs significantly better when i was running it.
Since the external validation using a separate dataset and achieved an AUROC of above 0.80, this suggests that it performs well on unseen data.
Let's try one last thing to see if we can improve the cross-validation exercise:
Please make sure to not modify the current test-train split (do not run the cell again)
As a comment to the above, the simple sklearn
classes RobustScaler
and SelectKBest
should be enough.
@GemmaTuron Good afternoon. I used both f_classif and mutual_info_classif and performed with k=100 and k=500, on both AutoML and Autogluon using this. https://github.com/ersilia-os/ersilia/issues/931#issuecomment-1932384872
The results i got were still around the range of the results we have been getting.
this is a snippet
scaler = RobustScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Step 2: Reduce Padel Descriptors to 100 using SelectKBest
k_best_selector = SelectKBest(score_func=mutual_info_classif, k=100)
X_train_selected = k_best_selector.fit_transform(X_train_scaled, y_train)
X_test_selected = k_best_selector.transform(X_test_scaled)
f_classif 100features- 0.68 +0.03 f_classif 500features= 0.66 +0.04
mutual_info_classif 100 features= 0.67 +0.05 mutual_info_classif 500 features= 067 + 0.03
f_classif 100features= 0.68 + 0.05 f_classif 500features= 0.68+0.03
mutual_info_classif 100 features= 0.66 + 0.05 mutual_info_classif 500 features= 0.68+ 0.03
I have created a PR with the figures
Thanks @leilayesufu. This is useful. @GemmaTuron it seems we are reaching a dead-end here. What is your opinion?
yes, I'll write to the authors requesting the original checkpoints and if we do not get them, I'd remove this model, unfortunately
Should we archive this model @DhanshreeA @miquelduranfrigola ?
I think we hit a dead en here. I would archive this model. Note that archiving repos may be inconvenient at times since we'll lose most git functionalities. Should we just label this as archived in AirTable?
I have added the archived label in Airtable In this case, I'd really archive the repo as we will not work on that any more. I'll close this issue as well!
Model Name
Drug-induced liver injury prediction
Model Description
Prediction of clinically relevant drug-induced-liver-injury (DILI), based solely on drug structure using binary classification methods. The results presented here are useful as a screening tool both in a clinical setting, in the assessment of DILI as well as in the early stages of drug development to rule out potentially hepatotoxic candidates.
Slug
dili-pred
Tag
Metabolism, Toxicity, Cytotoxicity
Publication
https://pubmed.ncbi.nlm.nih.gov/30325042/
Source Code
https://github.com/cptbern/QSAR_DILI_2019
License
None