GenoML / genoml2

GenoML (genoml2) is an open source Python package. It is an automated machine learning (autoML) platform for genomics data
Apache License 2.0
28 stars 17 forks source link

Discrete testing performance metrics not matching up with ROC plot output #25

Closed m-makarious closed 3 years ago

m-makarious commented 3 years ago

Please make sure that this is a bug.

System information:

Describe the current behavior: At the moment, when users use GenoML's discrete testing to validate a model generated earlier on an incoming dataset, the ROC AUC reported in the plot does not match the number reported in the performance metrics csv

Describe the expected behavior: They should match :)

Code to reproduce the issue: Provide a reproducible test case that is the bare minimum necessary to generate the problem. Running through the sequence outlined in the README of munging, training, harmonizing, re-training, tuning, and testing will result in this.

Other Information / Logs Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached.

Looking at the code, it looks like several values are being re-computed, likely resulting in this inflation seen in the performance metrics .csv. Shouldn't be too bad, looking to see that ROC, alongside the other metrics like balanced accuracy and log loss etc are not computed several times

m-makarious commented 3 years ago

Addressed in this newest push, and now reflected in the newest genoml version. Issue was we were re-calculating, which seemed to inflate the numbers (when it came to the metrics, not the plot!).

Changed the suffixes, since *allSamples maybe a bit misleading, so I have changed them to *_validationCohort_allCasesControls_* instead, to clarify that the dataset hadn't been split (which I think is what we were initially going for)