Closed m-makarious closed 3 years ago
@mikeDTI made a few fixes and we are proposing this plot to replace the distribution plots GenoML currently generates at the training stage. Percent probability is plotted on the x-axis, while counts are on the y-axis. Each reported status (so for discrete 0=controls, 1=case) has its own plot so probabilities can be individually assessed.
An example of the new probability plots:
An interpretation: 70 individuals with a reported status of 0 (controls) had a 0-10% probability of being predicted as a status of 1 (cases) in withheld samples. 5 individuals initially reported as 0 had a 10-20% probability of being predicted as a status of 1 in the withheld samples, etc...
Explanation: Probabilities are predicted case status (r1), so controls (0) skews towards more samples on the left and cases (1) skews more samples on the right. Having 0-10% of your controls be predicted as a case is good - you want as little as possible from the controls to be predicted as a case.
This will be reflected in the next pip release of GenoML as well.
Please make sure that this is a bug.
System information:
Describe the current behavior: Probability plots are plotting from the trained, not withheld sample dataset. This is reporting misinformation and led to blown-out unscaled plots.
Describe the expected behavior: Pulling the withheld sample predictions should fix the problem.
Code to reproduce the issue: Provide a reproducible test case that is the bare minimum necessary to generate the problem. Running any genoml discrete training will show this.
Other Information / Logs Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached.