GenoML / genoml2

GenoML (genoml2) is an open source Python package. It is an automated machine learning (autoML) platform for genomics data
Apache License 2.0
28 stars 17 forks source link

Overfitting issue with automated training - not nominating actual best model #21

Closed m-makarious closed 3 years ago

m-makarious commented 4 years ago

Please make sure that this is a bug.

System information:

Describe the current behavior: Currently, during training, the best algorithm nominated might have an AUC or Balanced Accuracy less than or equal to 50 (meaning, no better than chance) and/or sensitivity or specificity equal to exactly 1 or 0, meaning that the model is generalizing all samples to be just cases or controls, and not picking up on nuance.

Describe the expected behavior: This is to be expected, really. When you're competing a dozen or so algorithms, this can be due to chance/noise in the data - and is likely to happen again. We should work to nominate the model that picks up on nuance and can't be attributed to just chance.

Code to reproduce the issue: Provide a reproducible test case that is the bare minimum necessary to generate the problem. N/A

Other Information / Logs Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached.

Suggested my @mikeDTI, and I agree, that we should add a built-in check that deals with this overfitting issue, which produces problems downstream. The current suggested way to move forward is to check the best algorithm nominated, see if the AUC/balanced accuracy can be attributed to chance and/or sensitivity and specificity aren't picking up on the nuances between samples. If that is the case, remove that algorithm from the performance metrics all together, and nominate the "second" best algorithm.

Will get cracking on this soon...

m-makarious commented 3 years ago

Training will not nominate algorithms for "best algorithm" if their balanced accuracy less than or equal to 50%, |sensitivity-specificity| greater than 0.85, sensitivity equaling 0 or 1, or specificity equaling 0 or 1.

If none meet these requirements, then best algorithm is nominated as previously done before (best based on the metric the user chooses to maximize) (thanks @jfcarter2358 for the help!)

This will be included in genoml2's next package release :)

mikeDTI commented 3 years ago

Respect https://github.com/jfcarter2358

On Wed, Dec 16, 2020 at 2:05 PM Mary B. Makarious notifications@github.com wrote:

Closed #21 https://github.com/GenoML/genoml2/issues/21.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/GenoML/genoml2/issues/21#event-4121303645, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJTEJEJUTQASAHV7ZIOC2I3SVEAITANCNFSM4UJJ4J6A .

--

Mike A. Nalls, PhD

Data Tecnica International http://www.datatecnica.com/ Note: I check emails only in bursts ... for immediate project specific issues please use the relevant BaseCamp.