Whats the impact of choosing the best model based on `train set` vs `validation set`?

AaronSpieler commented 4 years ago

As discussed offline, we should conduct a more thorough study on what performance impact we can expect (on holdout set) when using the train set error as a basis for choosing the best model as opposed to choosing it based on a separate validation set.

StatMixedML commented 4 years ago

Thanks @AaronSpieler for opening the discussion!

Given that we have proper regularization of the model to balance the bias-variance-tradeoff and given that the train, validation and test set all come from the same distribution, errors should be comparable across the different sets, at least in theory. However, we know that the iid assumption is seldomly met in real world data sets and that instead all forms of distributional shift (response, covariate shift, ...) are usually present in the data, I wouldn`t consider training error to be a good approximation of the validation and test set error.

One way of tackling distributional shift in the form of the response distribution changing over time is to perform distributional forecasting, as GluonTS does, where all distributional parameters are modelled as functions of covariates/time. However, that doesn`t solve the problem of the train, validation and test errors to be potentially very different.

AaronSpieler commented 4 years ago

Also, I imagine that in particular iid doesn't hold for time series data, where there is a clear dependence of future data on the past, and currently the validation set is always the most recent horizon_length points.

Additionally, in particularly I wonder how we know that we have proper regularisation? Even if the data samples were iid, the actual training data is limited and thus continued training will lead to overfitting, right?

But there must be some literature that reviewed this exact problem for time-series prediction, do you have any recommendations?

StatMixedML commented 4 years ago

Concerning regularization: what is already implemented in GluonTS are learning_rate_decay, drop_out and weight_decay. Not sure how the latter is implemented, but it usually corresponds to L2-regularization using the frobenius norm. In addition to that, what could also be implemented is

early_stopping: very good way of preventing over-fitting if the validation-set error doesn`t improve for n-iterations, or even starts to increase (see also https://github.com/awslabs/gluon-ts/issues/555#issuecomment-572618033)

In addition to this, one could also think about using different train and evaluation sets (as is done in time-series cross-validation) and to average across them to prevent over-fitting). Also back-casting is an option, where instead of forecasting, we back-cast t0,…,t-1 using a model trained on t,…T, or a combination of both. All these are only suggestions and I need to consult the literature on potential approaches that have already been suggested.

The most important thing to consider, however, is how closely the validation and/or test set resemble the distribution of the train set (identically distributed). This is very crucial. All cross-validation techniques wouldn't work properly and consequently would give misleading error-measures between train/validation/test sets if the distribution of either the response and/or the covariates are different across the different sets. I am not talking about normalization (µ/sigma), but imagine that we want to predict the salary using age as a variable, where x_age in the train/validation set lies within a range of 15-50 years, whereas the test set only has observations with >60 years. Another example would be that we use categorical features in the test set that both train/validation set haven`t seen. Even if we find a good embedding of the categorical features or treat them as numeric, this poses a challenge for any forecasting model. To some extent, probabilistic forecasting takes care of a distribution shift in the response. However, covariate shift is left unsolved, even if we train a global model to learn the pattern across all time-series. Similarly, structural breaks and regime shifts between different sets poses another difficulty. Batch-Normalization might reduce the impact of these effects to some extent, but not entirely so.

AaronSpieler commented 4 years ago

Yeah the question originally arose because I was wondering why we don't have a validation data set to be able to do proper early stopping.

Regarding the iid assumption: as far as I can see it would be best to split the datasets like m4_hourly along the outer axis, meaning not along the time axis then, into train and validation or holdout (and potentially only use the last context_length+prediction_length data points). This way the latter dataset would presumably even more closely resemble the distribution of data we care about, better than samples from the train set? This should hopefully take care of the most probably cause for covariate shit: not using relevant (current) data?

Regarding structural breaks and regime shifts: I don't know to what extent, but shouldn't a large enough validation set also help to mitigate the impact of these? If we have breaks in the validation set as well as in the holdout set its actually a better measure of performance?

All of this of course wouldn't apply to datasets where we only have a single time series, which we could only potentially split along the time axis, however, all datasets that we measure model performance this is not the case.

So overall my hunch is that in any case the sources of such potential errors is much more severe when choosing the best model based on samples from the train set, which is the test set but without the last prediction_length time points.

What do you think @StatMixedML?

StatMixedML commented 4 years ago

So overall my hunch is that in any case the sources of such potential errors is much more severe when choosing the best model based on samples from the train set, which is the test set but without the last prediction_length time points.

Agree. Never should we use train data to select an appropriate model/parameters, as this is prone to over-fitting. When doing forecasting, we want a model that generalizes well from the train to the test set. All the issues mentioned above shouldn't prevent us from using validation sets to do model selection. Are there any plans to add early_stopping to GluonTS any time soon? Also, even though GluonTS models provide competitive forecasts with default parameters, it would be great to have Bayesian Optimization to tune the model parameters. That would enhance the usability and power of GluonTS even more. I have added this as an enhancement here https://github.com/awslabs/gluon-ts/issues/637#issue-565157580.

mharvan commented 4 years ago

I though that computing loss on validation data and early stopping based on the validation loss is already implemented in https://github.com/awslabs/gluon-ts/pull/378 . Is this correct or have I misunderstood something?

lostella commented 4 years ago

I though that computing loss on validation data and early stopping based on the validation loss is already implemented in #378

It is, but you have to provide the validation data, otherwise it will do it on the training set

StatMixedML commented 4 years ago

Thanks @lostella for clarifying that early stopping is already implemented (see https://github.com/awslabs/gluon-ts/issues/555#issuecomment-572618033). How is the function call to it and how to pass the validation set. Would be great if you`d provide a minimal working example. Thanks!

lostella commented 4 years ago

@StatMixedML just to be clear: I meant to say that providing validation data is supported, but that will be used for learning rate reduction only, as early stopping is actually missing (as your the comment you linked says).

Providing validation data to the training loop only requires that you give a second argument validation_data to the Estimator.train method. How to come up with such a dataset may be nontrivial though: if you have several time series, you can maybe just leave some out of the training data and use them for validation. However, sometimes you also want to split training/validation data along the time axis, and that’s currently a bit complicated maybe. For reference, see the discussion in #378

StatMixedML commented 4 years ago

@lostella: ok I see. So you would use something like estimator.train(training_data = train_ds, validation_data = validation_ds). Thanks.

StatMixedML commented 4 years ago

How to come up with such a dataset may be nontrivial though: if you have several time series, you can maybe just leave some out of the training data and use them for validation. However, sometimes you also want to split training/validation data along the time axis, and that’s currently a bit complicated maybe.

I agree that model selection / parameter tuning in a high-dimensional time series setting is not straightforward and needs some modification. However, what if we re-think the problem: instead of treating the data as time series, we can also think about it as being longitudinal in the sense that we have repeated observations nested within groups / clusters across time. Groups could be anything from individual time series IDs to time series being nested in hierarchies / clusters (product hierarchies, geographic etc.). K-fold cross-validation can also be used for leave-one-group/cluster-out cross-validation. Leave-one-group/cluster-out cross-validation is useful if the future prediction task would be to predict sales of new articles, or if we are interested in assessing the hierarchical part of the model. The function argument could be in such a way that the user can specify whether he/she wants to leave out individual time series or entire groups / hierarchies. The is related to what is stated in the DeepAR paper: “By learning from similar items, our method is able to provide forecasts for items with little or no history at all”. By using leave-one-group/cluster-out cross-validation we mimic cold start problems so that validation error measures more closely mimic true out-of-sample situations. One can also think of combining traditional with leave-one-group/cluster-out cross-validation in the sense that we have a mixture of evaluating the model on time series it has already seen and on “new” or left out groups and clusters. For the latter one, nested cross-validation might be preferable.

Some initial resources might include:

Let me know your thoughts on that @AaronSpieler, @lostella.

AaronSpieler commented 4 years ago

I have no experience with "leave-one-group/cluster-out cross-validation" but it sure sounds interesting, but again probably not so straight forward to define what should constitute a group or cluster, I will have to read up on that too.

StatMixedML commented 4 years ago

What if we do "leave-one-group/cluster-out cross-validation" using FieldName.FEAT_STATIC_CAT. Categorical features would provide a good way to perform group cross-validation. If the user specifies the field, we can leave some of the combinations out. If no such field is provided, we could use FieldName.ITEM_ID to sample some individual series.

seriousssam commented 4 years ago

I am not sure how these 2 features interact:

The train module accepts a validation set in addition to the training set {e.g., estimator.train(training_data = train_ds, validation_data = validation_ds) as @StatMixedML mentions above}
There is also an "avg_strategy" option in the definition of the esimator {e.g., avg_strategy=SelectNBestMean(num_models=1)}.

My question is: given the way things are now, is it possible for "best" to be defined based on validation loss? Or is it always defined as the network with the best training loss? I need it to be based on validation loss for my project and I'd be happy to dig deeper into it and submit a PR if it's needed.

Apologies if I'm using the wrong terminology I'm new to this repo but I'm finding it really useful!

Edit: I'm kind of answering my own question but maybe it'll be useful to someone in the future. I did some more digging and I my understanding is now that the 'best' net is indeed based on validation loss (?). I'm saying this based on the code in "gluon-ts/src/gluonts/trainer/_base.py" at line 304 at the moment:

`

                epoch_loss = loop(epoch_no, train_iter)
                if is_validation_available:
                    epoch_loss = loop(
                        epoch_no, validation_iter, is_training=False
                    )

                should_continue = lr_scheduler.step(loss_value(epoch_loss))
                if not should_continue:
                    logger.info("Stopping training")
                    break

                # save model and epoch info
                bp = base_path()
                epoch_info = {
                    "params_path": f"{bp}-0000.params",
                    "epoch_no": epoch_no,
                    "score": loss_value(epoch_loss),
                }

`

Based on this it should be the case that if a validation set is defined, the "score" ie loss that is saved is the validation loss. When I look at "gluon-ts/src/gluonts/trainer/model_averaging.py" I see that that's the metric that's use to find the best model(s). Do I have it right?

awslabs / gluonts

Whats the impact of choosing the best model based on `train set` vs `validation set`? #618