beacon-biosignals / Lighthouse.jl

Performance evaluation tools for multiclass, multirater classification models
MIT License
18 stars 2 forks source link

The great `evaluation_metrics_row` refactor! #69

Open hannahilea opened 2 years ago

hannahilea commented 2 years ago

Right now, evaluation_metrics_row is kludgy and tries to do way too much stuff, which makes it (a) hard to know exactly what your outputs relate to and (b) hard to customize any internal subfunctions without rewriting the entire loop (or threading some new param everywhere). It also combines a bunch of different types of output metrics into a single hard-to-parse schema (EvaluationRow). We'd like to refactor this!

Plan (as devised w/ @ericphanson): Split current evaluation_metrics_row function into three separate (types of) functions:

  1. a function that covers everything not dependent on a specific threshold OR on predicted_hard_labels (e.g., roc curves, pr curves, some calibration curves maybe)
  2. function(s) that choose a threshold, given the output of step 1 - this would also allow choice of specific desired threshold, e.g., thresh for given specificity OR thresh based on calibration curves (default now) OR thresh based on ROC curve minimization (other default currently!)
  3. functions that calculates derived stats that are based on hardened predictions (e.g., confusion matrix, ea kappas, etc) (multiclass AND per-class rows); two methods:
    • (convenience option) takes a hardening option + threshold + observations, compute predicted_hard_labels internally
    • takes predicted_hard_labels, goes from there

What is the output if you call all three steps??

Other important points:

ericphanson commented 2 years ago

I think one nice benefit here is that it'll be easier to use inspect results with several thresholds. For example, I could make a report with first threshold-independent metrics like ROC curves, PR curves, etc. Then I could show metrics with several choices of threshold, e.g. "sensitivity >= s1" for a few choices of s1, and "minimize calibration error", or "max sensitivity * (1-fpr)" (i.e. from the ROC curve).