Graphs - Githubissues

Berkmann18 commented 5 years ago

The amount of visual data and representation is lacking so here's what is missing:

[X] Bar chart of the label dataset
[x] Mixed bar chart of the training/validation/test dataset for a given learner (which would utilise Learner.getCategoryPartition())
[ ] ROC curve (cf. https://developers.google.com/machine-learning/glossary/#roc-curve) which will help in calculating the AUC (a probably relevant resource and another one).

Berkmann18 commented 5 years ago

Interesting and fairly relevant article: https://towardsdatascience.com/handling-imbalanced-datasets-in-machine-learning-7a0e84220f28

Berkmann18 commented 4 years ago

The visualisation in the /public directory isn't optimal and there are better visualisations to get answers from the data, starting from asking (new) questions:

How spread are the labels in the dataset?

There's quite a distance between the number of unclassifiable labels (those classified as `null`) and the rest especially those like `a11y`, `review`. Even less so, but still noticeable; is the `null`/`code` gap.

On the playground model, what's the proportion of training/validation/test data, is there a good split?