The great `evaluation_metrics_row` refactor!

Right now, evaluation_metrics_row is kludgy and tries to do way too much stuff, which makes it (a) hard to know exactly what your outputs relate to and (b) hard to customize any internal subfunctions without rewriting the entire loop (or threading some new param everywhere). It also combines a bunch of different types of output metrics into a single hard-to-parse schema (EvaluationRow). We'd like to refactor this!

Plan (as devised w/ @ericphanson): Split current evaluation_metrics_row function into three separate (types of) functions:

a function that covers everything not dependent on a specific threshold OR on predicted_hard_labels (e.g., roc curves, pr curves, some calibration curves maybe)
function(s) that choose a threshold, given the output of step 1 - this would also allow choice of specific desired threshold, e.g., thresh for given specificity OR thresh based on calibration curves (default now) OR thresh based on ROC curve minimization (other default currently!)
functions that calculates derived stats that are based on hardened predictions (e.g., confusion matrix, ea kappas, etc) (multiclass AND per-class rows); two methods:
- (convenience option) takes a hardening option + threshold + observations, compute predicted_hard_labels internally
- takes predicted_hard_labels, goes from there

What is the output if you call all three steps??

Currently, we return a single EvaluationRow, which contains threshold-dependent AND threshold-independent metrics, and where each field contains results for all classes AND (when valid) a multiclass value. (i.e., some fields are per class, some fields are multiclass, some contain both; some fields threshold dependent, some are threshold independent
After refactor, those should all be separate things:
- Should have separate schemas for threshold-dependent (from step 3) and threshold-independent statistics (from step 1)
- Each class should have its own output row(s) - #row
- Multiclass stats should be its own row - confusion matrix & multiclass expert-agreement kappa (if valid)

Other important points:

This doesn't necessarily account for specialized hardening (a la #68), but would make it clearer to implement. What needs a hardening function? Step 1 and the convenience option of step 3.
As much as possible we should reorganize the loop structure so that instead of doing each computation over all classes before moving on to the next metric, we instead do all computations for a single class and then iterate over all classes at the outer loop
make sure docs are clear about what values do and don't depend on predictions (i.e., ira does not!)
Can make a breaking release with all of these changes; depending on how much time it takes, can add a deprecation path to construct the current EvaluationRow from new row types (might be useful for testing that change adds no functional change, otherwise file it for later)
How does this interact with plotting? Plotting is its own special hell that we'll likely want to refactor also, but for now,

beacon-biosignals / Lighthouse.jl

The great `evaluation_metrics_row` refactor! #69