Changing offsets and sequence lengths lead to reasonable predictions but very odd training logs

SimonTopp commented 2 years ago

Not exactly an issue, but something to be aware of and that I'm curious to get other peoples opinions on. While hypertuning sequence length and offset for the RGCN, I found that while various sequence length/offset combinations lead to reasonable predictions, the training logs for any combination other than a sequence of 365 and offset of 1 are very wonky, almost as if they're beginning to overfit on the training data right from the start. For all the model runs below, the only thing changing between runs is sequence length and offset of our train, test, and val partitions, and I didn't see any obvious errors in the data prep or training pipelines that would account for the weird training logs shown in the final figure here.

First, each cell in the heatmaps below represent a different model run with a unique sequence length and offset combination, while some combinations appear to outperform others, there are no super suspect numbers.

Similarly, when we just plot out the validation predictions, they seem pretty reasonable.

But! When we look at our training logs, we see that our validation loss throughout training is very erratic for all combinations except sequence length 365/offset 1 and sequence length 180/offset 1 (note that here I was using an early stopping value of 20)

One thing to consider is that the offsets and shorter sequence lengths lead to a lot more samples for each epoch, so there are subsequently more batches/more updates in each epoch. Maybe this is just allowing the model to converge much more quickly? Or maybe there's something about the RGCN that depends on sequences starting on the same day of the year, but if that was the case, I wouldn't expect the high test/val metrics or reasonable predictions. I'm curious to get your thoughts @janetrbarclay, @jdiaz4302, @jsadler2.

SimonTopp commented 2 years ago

Quick update, when we restrict the training to only 4 years (meaning fewer batches/updates each epoch), we start to see training logs more like we'd expect in regards to validation loss. The model still appears to begin overfitting on the training data fairly quickly, which is probably worth keeping in mind for future applications of the RGCN.

jsadler2 commented 2 years ago

It is interesting that it's totally missing the summers. Would it be difficult to plot the same thing for the longer sequence lengths?

jdiaz4302 commented 2 years ago

Some thoughts for looking into this -

Regarding the 4 years of data training logs and the whole data set training logs, I would like to see the plots with the same range of values on the y-axis because when I try to mentally adjust the axes to be the same, I feel like they don't look as different - the top row of the second figure may actually be worse (e.g., use y = 0.4 as a reference)
It seems that the default dropout values for the repo are to set regular and recurrent dropout to 0. I'd be interested in seeing what these results look like if you set both dropout values to 0.5 (or maybe even 0.25). In either case, you may want to be less forgiving with the early stopping patience or use restore_best_weights=True because I feel like they show that the current approach can still allow a lot of overfitting
Which training log does the 60-day validation time series belong to? I agree that it isn't the worst, that's about what I'd expect from a first pass, non-tuned run. I'd definitely wager that those predictions look better than the randomly initialized weights' predictions (can verify) which the training log suggests have the same or better RMSE, so that's really strange.

_(last two points kinda fight with each other because restore_bestweights may say that best validation set RMSE was at epoch 0 but the forecasts could look better despite the training log painting a poor picture?)

SimonTopp commented 2 years ago

Thanks for chiming in @jdiaz4302! It sounds like overall you're suggesting re-running the grid search on the full training data but with more measures in place to avoid overfitting? In response to:

It seems that the default dropout values for the repo are to set regular and recurrent dropout to 0.

and

you may want to be less forgiving with the early stopping patience or use restore_best_weights=True

I'm using 0.3 for regular dropout and 0 for recurrent, but I'll redo a run with both set to 0.5 to see what happens. Also, these results are using the best_val_weights (not final weights) and with early stopping at 20.

In response to your first point. It may not be across the board, but with the same y-axis, to me it still looks like the logs from the 4 year training period show a better relationship between train loss and val loss.

Which training log does the 60-day validation time series belong to?

These are predictions using the best validation weights from the 60_1 run, so the model had 6 or 7 epochs of training, which from the log we can see performs better than the initialization weights. I haven't looked at what the predictions from the final weights are, but I'm guessing they'd be much worse. Would definitely be interesting to look into.

jdiaz4302 commented 2 years ago

Thanks for those plots and the clarifications on your run specs.

I find the training logs with the full y range a lot more comforting. I didn't have any range in mind for the first plots so they looked pretty wild and chaotic as a first impression but given the fully y range, that these are like first passes in new conditions and the "very odd training logs" discussion prompt, I feel like even the worst among them are pretty reasonable - showing normal-to-moderate levels of imperfections (with moderate levels being more exclusive to the offset = 0.25 and, to a lesser extent, the offset = 0.5 columns with the full training period).

It sounds like overall you're suggesting re-running the grid search on the full training data but with more measures in place to avoid overfitting?

Yeah, it sounded like you suspected overfitting, so setting regular and recurrent dropout to 0.5 should be a very strong measure to combat that (not necessarily the best for performance but it would be difficult to overfit with those conditions). This may drop the moderate levels of imperfections to normal for those worse training logs.

aappling-usgs commented 2 years ago

Why do you suppose the val Rmses are almost always lower than the train Rmses even for 0 epochs?

SimonTopp commented 2 years ago

Why do you suppose the val Rmses are almost always lower than the train Rmses even for 0 epochs?

@aappling-usgs, I believe that epoch loss is defined as the average of the batch losses over the epoch. Therefore in epoch 0 (epoch 1 in reality), the training rmses will have some really high values from the first few batches with the randomly initialized weights that will pull the average up. The val loss on the other hand is only calculated after the first full epoch of training, so it won't have those super high initial values artificially inflating the loss.

jsadler2 commented 2 years ago

@SimonTopp - not sure if you saw my comment above (@jdiaz4302 and I posted at the same time). I'm wondering if for other sequence lengths the model predictions are similarly poor in the summers.

SimonTopp commented 2 years ago

@jsadler2 Definitely looks like with the shorter sequences we do worse during the summer months. While they never really get there, they do get closer to matching the obs as you move from 60 to 365. Also, in plotting these up I realized that our keep_frac for the predictions is acting a little wonky for some of the combinations. This should only be impacting the final predictions and not those in the training log (in other words, I don't think the predictions missing in the last figure aren't missing during train/val, just in the output from predict() after training is finished. I'll look into it.

And the bad keep fraction example

SimonTopp commented 2 years ago

It's also worth noting that the poor summer predictions vary from reach to reach, below are two additional examples

jsadler2 commented 2 years ago

Thanks for making and posting those, @SimonTopp. It's very interesting that the longer sequence lengths do better in the summer. ... I can't think of why that might be 🤔

janetrbarclay commented 2 years ago

It seems reasonable to me that having a full year of data could help with the summer temp patterns since the model is then seeing the full annual temporal pattern rather than just a snippet.

It also looks like the summer observations for some of those reaches are a little unexpected, either plateauing or decreasing where I'd expect increases.

SimonTopp commented 2 years ago

So I believe I've gotten to the bottom of this. The baselines above were all done with dropout set to 0.3 and recurrent dropout set to 0. I did some simple tuning of the dropout space using 180/0.5 (sequence/offset) runs as a test. Below, rows are dropout and the columns are recurrent dropout. We can see that while recurrent dropout improves the train/val relationship, regular dropout significantly reduces validation performance leading to the training logs shared earlier.

When we then redo the hypertuning of the sequence length and offset with recurrent dropout set to 0.3 and regular dropout set to zero, we see significant improvement in all of our metrics, with our best validation RMSE going from 2.13 to 1.68. We also see our best run moves from 365/1 (sequence/offset) to 60/0.25.

Finally, with the new dropout setting, we see overall much better agreement with the summer months.

There are still a few questions. Specifically:

Why does dropout have such a detrimental effect while recurrent dropout appears to help/perform as expected.
Is the 60/0.25 sequence length/offset combo the best performing because the model doesn't gain useful temporal information beyond 60 days, or because it leads to more samples in our training data for the model to learn from?

I think this issue is largely solved, but I'll leave it open until the end of the day in case anyone has some follow up thoughts.

USGS-R / river-dl

Changing offsets and sequence lengths lead to reasonable predictions but very odd training logs #152