all-contributors / ac-learn

ML platform for all contributors
MIT License
5 stars 4 forks source link

Graphs #20

Closed Berkmann18 closed 4 years ago

Berkmann18 commented 5 years ago

The amount of visual data and representation is lacking so here's what is missing:

Berkmann18 commented 5 years ago

Interesting and fairly relevant article: https://towardsdatascience.com/handling-imbalanced-datasets-in-machine-learning-7a0e84220f28

Berkmann18 commented 4 years ago

The visualisation in the /public directory isn't optimal and there are better visualisations to get answers from the data, starting from asking (new) questions:

  1. How spread are the labels in the dataset?
  2. labelDist There's quite a distance between the number of unclassifiable labels (those classified as `null`) and the rest especially those like `a11y`, `review`. Even less so, but still noticeable; is the `null`/`code` gap.
  3. On the playground model, what's the proportion of training/validation/test data, is there a good split?
  4. partitions As expected we have a 15/15/75 split in percentage.
  5. On the playground model, what's Q2 like for each label?
  6. partitionsByCategories So obviously each category have more training observations but some have no observations in the test or validation set from the playground model.
  7. What does the playground model use to train itself?
  8. trainingLabelDist Same big picture as on the first question with some small differences.
  9. What does the playground model use to validate itself?
  10. validationLabelDist Sadly only 18/31 (≈58.06%) are represented in the validation set
  11. What does the playground model use to test itself?
  12. testLabelDist Sadly only 16/31 (≈51.61%) are represented in the test set and it doesn't seem to reflect well the real world based on the repos I visited.