Add support for significance testing, random splits, and other pro-reproducibility features

julianmichael commented 5 years ago

Is your feature request related to a problem? Please describe. There's been a lot of talk in the community about reproducibility issues, incl. with effects of maxing over hyperparameter settings and overfitting to the test set.

Describe the solution you'd like There are some recommendations that have come out of the literature and water cooler: standard hyperparameter search procedures, reporting mean/stdev of multiple training runs, significance testing with bonferroni procedure on random splits, etc. and I think adoption of these techniques is usually driven by ease of use. So AllenNLP seems like a perfect place to implement tools to do this: significance testing relates to the choice of metric, data splits fit naturally into the trainer abstraction, and hyperparameter optimization is a feature of the training procedure. All of these are extant abstractions in AllenNLP and any implementation of these features on top of AllenNLP would likely duplicate work. More concretely:

An additional Trainer (or config options) that constructs random train/dev/test splits and stores metadata to reconstruct the split in the serialization directory.
Options for Trainer that support multiple training runs (same model hyperparams, optionally different data splits) and aggregating metrics across runs; perhaps with a new serialization directory structure that contains each individual run as a subdirectory, with aggregated metrics in the superdirectory (matrix directory?).
A module for significance testing, with appropriate subcommands including interop with the new trainers: imagine typing something like allennlp sigtest students-t save/model1_multi save/model2_multi and immediately getting results---or getting warnings/errors if the underlying datasets/training configs don't match.
Possibly additions to the Metric abstraction to (optionally) support canonical aggregation methods and significance tests; for example, if the final metric is the mean of the metric value on examples, it is appropriate to perform a paired Student's t-test (as long as n is great enough), so out-of-the-box support for this can be documented and tested as a one-liner.

Hyperparameter optimization may still be out of scope for now / more appropriate for an outer layer like Beaker.

Describe alternatives you've considered One alternative is to leave this stuff as an "outer layer" for a separate library. However, the tools I'm describing seem too specific for Beaker and are conceptually closely tied to AllenNLP. Easy interop (e.g. with one-line significance testing) would require relying on the AllenNLP workflow anyway.

Additional context Partly inspired by the recent ACL 2019 best paper nominee We need to talk about standard splits by Kyle Gorman and Steven Bedrick.

matt-gardner commented 5 years ago

Yes, having most of these inside allennlp would be great. I'm not sure that they fit the Trainer abstraction as you say, though, because of the assumptions the trainer makes. It feels like each of these would be something that sits on top of the main functionality, which is just designed for doing a single training run on a fixed dataset.

For each of your suggestions:

Cross validation / random splits: something splits the data and constructs several config files (probably making good use of jsonnet for this), which then get run in turn.
Significance testing you already say would be a new module, and I agree. There must be libraries that already do this, so it's just a question of interfacing with them. We don't output predictions / metrics in a decent way most of the time, because we're just looking for aggregate metrics, so something would have to change here.
Possible additions to Metric - it doesn't look feasible to me to have the metric class handle what you suggest, because it just operates on a single training run (and mostly gets cleared each epoch). It would have to live outside, as suggested in the point above. But there may be some way to signal (maybe just with good documentation?) what the right significance test is for each metric.

So, yeah, most of what you're suggesting should live above / outside of the existing abstractions in allennlp. They seem central enough that I'm not opposed to having them in the main library, though, if someone wants to contribute stuff like this. We also do want to encourage people to have their own repos that interoperate with allennlp, to keep this repo's dependencies and maintenance burden down, so it would also be fine with me to have these tools in a separate repo.

bryant1410 commented 4 years ago

I've been thinking about cross-validation. I think it doesn't fit well as a Trainer because you're already given a DataIterator, instead of a DatasetReader, so thing may already be batched, are hard to group (e.g., leave-one-group-out CV), and are hard to stratify (e.g., stratified k-fold CV).

I think it fits near where the train command definition is right now. train now assumes a 1, 2, or 3-way holdout method, so it assumes a train_data_path + possibly validation_data_path + possibly test_data_path. To accept CV, this should be different. For CV, it should accept only one data_path and a DatasetReader (and a "cross validator", such as those from scikit-learn; e.g., KFold). It could accept other things such as nested CV (see https://arxiv.org/abs/1811.12808), and maybe combining some stuff such as CV only for model selection + holdout test.

I imagine it like the trainer accepting some data_split or evaluation thing, which could be of type holdout or cv, or something like that. That would take some parts out from TrainModel.from_partial_objects. It's like the train command has some method evaluation regime which could be holdout, CV or something else, that's how I see it at least.

@matt-gardner thoughts?

matt-gardner commented 4 years ago

If I were trying to do this, I'd implement a class very much like TrainModel, maybe called CrossValidateModel, that took whatever inputs it needed, then probably instantiated a trainer inside of a for loop, to do the cross validation.

bryant1410 commented 4 years ago

Yeah, that's actually what I did after writing the message, thinking it was not good but it actually is. There are some caveats, which I can discuss when I have something.

allenai / allennlp

Add support for significance testing, random splits, and other pro-reproducibility features #3100