`min_train_window`, `quantile_by_key` args are not used

cmu-delphi / epipredict

Tools for building predictive models in epidemiology.

https://cmu-delphi.github.io/epipredict/

Other

10 stars 10 forks source link

`min_train_window`, `quantile_by_key` args are not used #169

Open brookslogan opened 1 year ago

brookslogan commented 1 year ago

Grepping around, @dshemetov and I were unable to see min_train_window or quantile_by_key actually used by arx_forecaster, arx_classifier, flatline_forecaster, or any other forecaster I missed. We did find:

There might be a PR for the min_train_window part; see comment here.
quantile_by_key docs reference layer_residual_quantiles()' by_key parameter, but I can't see it actually being forwarded along.

brookslogan commented 1 year ago

Some thoughts on naming things here and how they should function.

Naming:

@dshemetov, can you please add your suggestions?
max_n_recent, min_n_recent
max_n_times_per_epikey, min_n_times_per_epikey

Functionality:

Seems best to line up the ideas of training window and min train window, so if they request a min & max of 30, it won't drop stuff / error / etc. if one location is a little behind on reporting. (Something slightly suboptimal / a tradeoff: if say VI is behind on reporting, then its training time set is different than all the others. But I think we want to take this tradeoff.)
Can't think of other actually-competing options.

brookslogan commented 1 year ago

This also ties into discussions about alternative interpretations of n in epix_slide, or making epipredict naturally act on epi_archives. Right now, epiprocess main has a different interpretation with epix_slide's n parameter, as does the dev branch's replacement, the before parameter.

brookslogan commented 1 year ago

The min training window would also be useful as a step, or as part of step_training_window.

brookslogan commented 1 year ago

A potential bigger tradeoff we're taking with a max_n_recent, min_n_recent approach is potential incompatibility with how things are treated at test time. At train time, we're happy to go search for training data farther in the past. But at test time (since we're fitting one model across all locations/etc.), we are being more demanding: we don't output forecasts for locations that have the expected predictor data missing. Not sure there's a better alternative though.

brookslogan commented 1 year ago

And another trade-off: it'll be harder to get epix_slide to feed an appropriate time window of data in using its current approach. But we were considering changing how it does time windowing anyway. [On current epiprocess dev branch: we could just use before = 365000L to get everything, but that's not efficient. So then we have to think about what the max number of additional times might be grabbed by this n_recent approach.]

dshemetov commented 1 year ago

I like min_n_recent and max_n_recent
min_n_recent being mostly informational / for debugging sounds good