Open julianmichael opened 5 years ago
Yes, having most of these inside allennlp would be great. I'm not sure that they fit the Trainer
abstraction as you say, though, because of the assumptions the trainer makes. It feels like each of these would be something that sits on top of the main functionality, which is just designed for doing a single training run on a fixed dataset.
For each of your suggestions:
Metric
- it doesn't look feasible to me to have the metric class handle what you suggest, because it just operates on a single training run (and mostly gets cleared each epoch). It would have to live outside, as suggested in the point above. But there may be some way to signal (maybe just with good documentation?) what the right significance test is for each metric.So, yeah, most of what you're suggesting should live above / outside of the existing abstractions in allennlp. They seem central enough that I'm not opposed to having them in the main library, though, if someone wants to contribute stuff like this. We also do want to encourage people to have their own repos that interoperate with allennlp, to keep this repo's dependencies and maintenance burden down, so it would also be fine with me to have these tools in a separate repo.
I've been thinking about cross-validation. I think it doesn't fit well as a Trainer
because you're already given a DataIterator
, instead of a DatasetReader
, so thing may already be batched, are hard to group (e.g., leave-one-group-out CV), and are hard to stratify (e.g., stratified k-fold CV).
I think it fits near where the train
command definition is right now. train
now assumes a 1, 2, or 3-way holdout method, so it assumes a train_data_path
+ possibly validation_data_path
+ possibly test_data_path
. To accept CV, this should be different. For CV, it should accept only one data_path
and a DatasetReader
(and a "cross validator", such as those from scikit-learn; e.g., KFold
). It could accept other things such as nested CV (see https://arxiv.org/abs/1811.12808), and maybe combining some stuff such as CV only for model selection + holdout test.
I imagine it like the trainer accepting some data_split
or evaluation
thing, which could be of type holdout
or cv
, or something like that. That would take some parts out from TrainModel.from_partial_objects
. It's like the train
command has some method evaluation regime which could be holdout, CV or something else, that's how I see it at least.
@matt-gardner thoughts?
If I were trying to do this, I'd implement a class very much like TrainModel
, maybe called CrossValidateModel
, that took whatever inputs it needed, then probably instantiated a trainer inside of a for loop, to do the cross validation.
Yeah, that's actually what I did after writing the message, thinking it was not good but it actually is. There are some caveats, which I can discuss when I have something.
Is your feature request related to a problem? Please describe. There's been a lot of talk in the community about reproducibility issues, incl. with effects of maxing over hyperparameter settings and overfitting to the test set.
Describe the solution you'd like There are some recommendations that have come out of the literature and water cooler: standard hyperparameter search procedures, reporting mean/stdev of multiple training runs, significance testing with bonferroni procedure on random splits, etc. and I think adoption of these techniques is usually driven by ease of use. So AllenNLP seems like a perfect place to implement tools to do this: significance testing relates to the choice of metric, data splits fit naturally into the trainer abstraction, and hyperparameter optimization is a feature of the training procedure. All of these are extant abstractions in AllenNLP and any implementation of these features on top of AllenNLP would likely duplicate work. More concretely:
Trainer
(or config options) that constructs random train/dev/test splits and stores metadata to reconstruct the split in the serialization directory.Trainer
that support multiple training runs (same model hyperparams, optionally different data splits) and aggregating metrics across runs; perhaps with a new serialization directory structure that contains each individual run as a subdirectory, with aggregated metrics in the superdirectory (matrix directory?).allennlp sigtest students-t save/model1_multi save/model2_multi
and immediately getting results---or getting warnings/errors if the underlying datasets/training configs don't match.Metric
abstraction to (optionally) support canonical aggregation methods and significance tests; for example, if the final metric is the mean of the metric value on examples, it is appropriate to perform a paired Student's t-test (as long as n is great enough), so out-of-the-box support for this can be documented and tested as a one-liner.Hyperparameter optimization may still be out of scope for now / more appropriate for an outer layer like Beaker.
Describe alternatives you've considered One alternative is to leave this stuff as an "outer layer" for a separate library. However, the tools I'm describing seem too specific for Beaker and are conceptually closely tied to AllenNLP. Easy interop (e.g. with one-line significance testing) would require relying on the AllenNLP workflow anyway.
Additional context Partly inspired by the recent ACL 2019 best paper nominee We need to talk about standard splits by Kyle Gorman and Steven Bedrick.