beacon-biosignals / Lighthouse.jl

Performance evaluation tools for multiclass, multirater classification models
MIT License
18 stars 2 forks source link

RFC: Remove training harness wrappers (`learn!`, `evaluate!`) from Lighthouse.jl? #81

Open hannahilea opened 2 years ago

hannahilea commented 2 years ago

I took a recent time-boxed* attempt at deprecating EvaluationRow (#71), and it was Not A Fun Time™. I'm not sure what the correct thing to do here is, and I think that it is very possible that we should remove some (all?) of the training harness aspects of Lighthouse.jl altogether---or at least, we should figure out the minimal set of them that users actually depend on.

We've spent a certain amount of time maintaining the existing harness behavior, but afaik our internal usage of the harness itself is basically non-existent, as teams have combined the Lighthouse components in their own ways to support their specific use-cases. (Specifics to be left to a Beacon slack thread!) Externally, I would be surprised if anyone was depending on them (although if someone is, please say the word!!). Additionally, https://github.com/beacon-biosignals/LighthouseFlux.jl/blob/main/src/LighthouseFlux.jl does not use learn! or evaluate!.

*technically it was also [hurtling-through-]space-boxed as well, huzzah for transcontinental flight....


The particular hole I fell down was trying to cleanly swap out the use of evaluation_metrics_row (and its corresponding log_evaluation_row!) inside evaluate!.

https://github.com/beacon-biosignals/Lighthouse.jl/blob/0540cdd5fcfc7a95cdd0b8c080febc42da736dbe/src/learn.jl#L233

Why? Well, the existing usage conflates tradeoff metrics and hardened metrics, and we shouldn't be computing (and logging) both there without the caller specifying things like how the per-class binarization should happen (to create the hardened metrics), and whether they want to additionally compute multirater label metrics (i.e., if votes exist), and whether they additionally want to compute additional metrics that (until now) have only been computed for binary classification problems.

At the point that we expose separate interfaces for the various combinations of inputs that a user might want (mulitclass v binary, single rater v multirater, binarization choice, pre-computed predicted_hard_labels for multiclass input), we will have created a lot more code to maintain for a lot less benefit. At the end of the day, asking the caller to implement their own evaluate!(model::AbstractClassifier, ...) would be easier.

...which seems like a viable option. Except what inputs would a template evaluate! function take? What combination of the existing

predicted_hard_labels::AbstractVector,
              predicted_soft_labels::AbstractMatrix,
              elected_hard_labels::AbstractVector,
              classes, logger;
              logger_prefix, logger_suffix,
              votes::Union{Nothing,AbstractMatrix}=nothing,
              thresholds=0.0:0.01:1.0, optimal_threshold_class::Union{Nothing,Integer}=nothing

? Because the choice depends on the type of metrics being computed in the first place.

If various projects were depending on this evaluate! function, it seems like choosing some minimal required arguments (

predicted_soft_labels::AbstractMatrix,
              elected_hard_labels::AbstractVector,
              classes, logger;
              logger_prefix, logger_suffix

?) and passing the others as kwargs from the learn! function might do it. Buuuut I'm not sure how many folks use this function in the first place!


Which brings me to: what is the correct thing to do here? And does anyone use the learn! and evaluate! functions, or could/should we remove them except as examples (in the docs) for how to implement a custom loop for one's own model?

To quote @ericphanson, when talking about this issue,

I think a harness like that it just too rigid. We just need a lot of useful primitives like metrics, logging, plots, etc, and then we can combine them in different ways as makes sense for the project

I agree---and so I think losing the evaluate! and learn! calls as they currently exist would allow us to focus on those primitives and sink less time into maintaining code that isn't used. And if more flexible (or smaller scoped) harnesses make their way back into Lighthouse.jl, I think that would be cool also, but we should let that happen from the ground up once we create training loops that actively work for us and could be shared.

ericphanson commented 2 years ago

I agree! I think one annoying thing about this harness is if you want access to some variable inside train!, you need to "smuggle" it in inside the model object. E.g. right now learn! is (simplified):

    for current_epoch in 1:epoch_limit
         train!(model, get_train_batches(), logger)
         ...other stuff...
         post_epoch_callback(current_epoch)
    end

so if you want to know the current epoch inside train! (e.g. for checkpointing, to be able to save out the epoch and batch number along with a model part way through an epoch) you need to do something unpleasant like put a mutable Ref inside your model to count the epoch and update it in the post_epoch_callback (...which I actually do...).

One fix to that particular problem would be to also pass in the epoch to train!. But it doesn't solve the general problem of wanting to pass other stuff in. E.g. if I have a parameter that governs how often I do a more expensive logging operation, I need to put that inside the model too. And if I want to evaluate on multiple datasets and keep the results separate, I need to pass in the parameters to get_test_batches and then overload evaluate!. Which is possible but just kinda annoying.

I think the main benefit of a harness is that it's easy to get started; it tells you what things you need (although TBH the current docs are kinda confusing on this point) and then it assembles them for you to have a basic training harness to get started. IMO that benefit could be obtained by having an example (that runs in CI so it doesn't bitrot) that shows you a basic harness. E.g. we could start just by moving this code from the package to an example file (with the key point that we can freely change the example code without breaking anything downstream). Then when you start a new modelling project, you can copy-paste the example to your codebase, and then start tweaking it.

FWIW Flux itself has had this discussion since they provide a training loop, e.g. https://discourse.julialang.org/t/is-it-a-good-time-for-a-pytorch-developer-to-move-to-julia-if-so-flux-knet/38453/47. They currently do provide one but I've often seen people say that's mostly for examples and you should copy the loop and modify it yourself once you need to. IMO in our case the maintenance costs outweigh the benefits.

Additionally, https://github.com/beacon-biosignals/LighthouseFlux.jl/blob/main/src/LighthouseFlux.jl does not use learn! or evaluate!.

That's true, though it's meant to be if you use LighthouseFlux, it just implements train!, and then you can use Lighthouse.learn and "everything will work", i.e. the methods in Lighthouse for learn! and evaluate! will work with that Flux model.

kleinschmidt commented 2 years ago

I generally agree that the training harness is too rigid to be really useful (in a recent modeling effort we considered building around it but it was gonna be hard to integrate into the distributed harness we were setting up).

IMO that benefit could be obtained by having an example (that runs in CI so it doesn't bitrot) that shows you a basic harness. E.g. we could start just by moving this code from the package to an example file (with the key point that we can freely change the example code without breaking anything downstream).

That would be well served by a doctest IMO, which is/can be run in CI and also is surfaced in the docs so it's clearly an example!

ericphanson commented 2 years ago

That would be well served by a doctest IMO, which is/can be run in CI and also is surfaced in the docs so it's clearly an example!

Hm, not really sure about this-- it could be like a several-hundred-line doctest if it's really defining learn!, train!, evaluate!, predict!, etc.

I think there could be overlap w https://github.com/beacon-biosignals/LegolasFlux.jl/blob/main/examples/digits.jl, which is run in LegolasFlux tests to test model serialization on a more complicated example. I wonder if there could be like an "example model" that uses & checks that everything works together (Lighthouse, logging, serialization, etc) which could serve as a starting point for new models as well as be open-source and could help with reproducing issues etc.