Incorrect probability plots being exported during discrete training

GenoML / genoml2

GenoML (genoml2) is an open source Python package. It is an automated machine learning (autoML) platform for genomics data

Apache License 2.0

27 stars 17 forks source link

Please make sure that this is a bug.

System information:

OS Platform and Distribution (e.g., Linux Ubuntu 16.04): macOS Mojave (v10.14.6)
GenoML Installed from (source or binary): Source
GenoML Version: v1.b9
Python Version: 3.7+

Describe the current behavior: Probability plots are plotting from the trained, not withheld sample dataset. This is reporting misinformation and led to blown-out unscaled plots.

Describe the expected behavior: Pulling the withheld sample predictions should fix the problem.

Code to reproduce the issue: Provide a reproducible test case that is the bare minimum necessary to generate the problem. Running any genoml discrete training will show this.

Other Information / Logs Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached.

@mikeDTI made a few fixes and we are proposing this plot to replace the distribution plots GenoML currently generates at the training stage. Percent probability is plotted on the x-axis, while counts are on the y-axis. Each reported status (so for discrete 0=controls, 1=case) has its own plot so probabilities can be individually assessed.

An example of the new probability plots: new_plots

An interpretation: 70 individuals with a reported status of 0 (controls) had a 0-10% probability of being predicted as a status of 1 (cases) in withheld samples. 5 individuals initially reported as 0 had a 10-20% probability of being predicted as a status of 1 in the withheld samples, etc...

Explanation: Probabilities are predicted case status (r1), so controls (0) skews towards more samples on the left and cases (1) skews more samples on the right. Having 0-10% of your controls be predicted as a case is good - you want as little as possible from the controls to be predicted as a case.

This will be reflected in the next pip release of GenoML as well.

GenoML / genoml2

Incorrect probability plots being exported during discrete training #24