Somewhat significant update

SimonTopp commented 2 years ago

This PR addresses most of the issues mentioned in #135. Specifically:

It updates the data prep pipeline to include all data partitions in the pre-training as the default. This does lead to a little redundancy in prepped.npz because the x_vars get repeated in the pre-train split. This is something we may want to come back to down the line.
The general format of the training has been changed to make the train_model call slightly more general and allow us to break up pre-train and finetuning rules in the snakemake. This doesn't apply to the cli version of rule since I think our plan is to remove it (#117)
It calculates validation loss at each epoch during training
It adds two callbacks to the training phase, one of which allows for early stopping and the other saves the weights from the best validation model during training (default predictions are now with best validation weights)
It updates the initialization of the rgcn weights to the Xavier initialization following #139.
It changes the default evaluation from the val partition (which is now used in training), to the test partition.

There's kind of a lot here, so feel free to provide feedback and suggestions. @jsadler2, @jzwart, @janetrbarclay, if any of you have time to review that'd be awesome! Also, my google drive just updated and it changed my paths/confused git a little. The files below that say 'empty file' have not been changed.

jdiaz4302 commented 2 years ago

Cool update! I had some thoughts and questions that are about the ML approach.

It changes the default evaluation from the val partition (which is now used in training), to the test partition.

Clarification on existing workflow (prior to this PR): the test set isn't used for anything, right? (It's being saved for a final evaluation of some sort to make sure that you haven't overfit hyperparameters (e.g., model choice) to the validation set)

I think it would be preferable to still evaluate on the validation set because the validation set is intended to help choose hyperparameters, and the number of epochs (as determined by early stopping) is a hyperparameter. Also, if you don't continue to evaluate on the validation set, you may be using years of data just to tune one knob (# of fine tune epochs). Likewise, the test set becomes a second validation set, so if you want a final assessment of performance you'll need even more data or smaller partitions.

It updates the data prep pipeline to include all data partitions in the pre-training as the default.

Are the pre-training targets for the validation and testing partitions informed using real targets during this time? i.e., have the PB models seen any of this data and been optimized with it or are they more theoretical or derived from different/earlier data. I have sort of a gut-reaction against using the same output (e.g., PB temperature) from unseen dates for pretraining when we will later evaluate that output (e.g., true temperature)on those dates as "out-of-sample", but I don't think that's a problem if the PB can be thought of as simply simulations for that time using equations that were not fit to that time.

SimonTopp commented 2 years ago

Thanks Jeremy! To clarify:

I think it would be preferable to still evaluate on the validation set because the validation set is intended to help choose hyperparameters, and the number of epochs (as determined by early stopping) is a hyperparameter.

Here, the validation is used to tune the number of epochs during training. Then, when we're calculating the final metrics after all training and hypertuning, I changed the default partition from the val partition to the test. Our final test of model performance used to be on the val partition, but since we're now training to that in some ways, the new final test is on the test partition. Does that make sense?

but I don't think that's a problem if the PB can be thought of as simply simulations for that time using equations that were not fit to that time.

This change, of showing the pretraining the entire time period of the SB outputs is meant to help the model generalize to 'out-of-sample' conditions. I think it's safe to think of the PB outputs/pretraining here as somewhat independent from our val/test partitions even though it's giving the model a sneak peak of those years and/or reaches that are being held out.

jdiaz4302 commented 2 years ago

Here, the validation is used to tune the number of epochs during training

This is what the validation set would/can be used for normally (i.e., setting a hyperparameter, while the training set is used for setting parameters/weights/coefficients), so I don't think using it for early stopping means that you can't use it for between-model evaluation (here, models A and B stopping at training epoch 16 versus 17, respectively). For example, in a generic scenario, you train the same model with 100 epochs, 200 epochs, and 300 epochs, then the validation set would be used to say "ok, 200 epochs is the way to go based on validation set RMSE." Using the validation set for early stopping is just more of a dynamic grid search-like approach for determining the number of epochs (to combat overfitting).

Our final test of model performance used to be on the val partition

We might be using the word 'final' differently. I generally see the validation partition as a way for you (e.g.) to compare the RGCN with 200 pretraining epochs versus the RGCN with 0 pretraining epochs (i.e., compare hyperparameter settings). So, the RMSEs generated on the validation set will be used to compare those different hyperparameters (here, training conditions), then once you have picked the best model (as determined by the validation set), the test set is used to assess performance on data that has not been used in your model construction decision making.

So now (with this PR), if you were to experiment with a RGCN with 200 pretraining epochs versus a RGCN with 0 pretraining epochs (or lets say you added a new variable), the test set would be used to compare those two, and all of your data would have been used in your decision making, leaving no data for a final (more objective) assessment of performance. Is this type of experimentation being done or is it a goal (e.g., with other hyperparameters: different model architectures, loss functions, variable sets)?

I think it's safe to think of the PB outputs/pretraining here as somewhat independent from our val/test partitions even though it's giving the model a sneak peak of those years and/or reaches that are being held out.

Cool, my only concern was if PB pretraining are learned from PB outputs, representing a sort of data leak. Like if I had a data set from 2000-2020 of Y1 and made predictions for that (Y2). Then I tried to predict Y1 with a different approach using 2015-2020 as out-of-sample, then used Y2 2000-2020 as pretraining.

Hope that makes sense and sorry for the book! 😵

SimonTopp commented 2 years ago

Haha, no worries about the book! I think we're on the same page now. I've seen the test holdout go both ways, but I do see the logic in what you're suggesting (leaving it until the model has been completely finalized). In the snakemake it's pretty easy to calculate the error metrics for either or both. What do you think, calculate both or not even calculate test because we're humans and can't help but look at those test numbers during hypertuning/development?

jdiaz4302 commented 2 years ago

I prefer calculating and looking at error metrics on the validation set and locking away the test set until it's paper-writing time to help it be as out-of-sample as possible (i.e., not looking at its performance and that affecting your decision making). Surprises can happen with that approach, but I think that's part of the scientific discovery and narrative.

Also, agreed that I've seen it both ways; I prefer this way because (informally) I've heard teachers and mentors express skepticism to not having a "true" test set / (over)fitting to the set you are using for evaluation.

jordansread commented 2 years ago

locking away the test set until it's paper-writing time to help it be as out-of-sample as possible

Wise.

janetrbarclay commented 2 years ago

If I'm reading this correctly, the default is for the early stopping to only be on the fine tuning with the base options (not the pretraining and not with the gw loss option). Is there a reason for this?

I've made a bunch of minor comments throughout the code, but haven't quite finished. I'll look more tomorrow morning.

jsadler2 commented 2 years ago

I'm starting to review this. @SimonTopp - do you (or I guess anyone else) know why these are showing up like this in the "Files changed" tab? I've never seen that before.

It happens with 17 files!

SimonTopp commented 2 years ago

@jsadler2 I mention it in the PR description, but basically my google drive updated in between two commits and I think it changed the base filepaths, then when I pushed, even though the files hadn't changed it thought they were different. I didn't notice it until I actually submitted the PR. When you look at the diffs though they show that they're identical with the existing files in riv-dl though. If you find it worrisome I can rollback those specific files, just let me know.

jsadler2 commented 2 years ago

If you find it worrisome I can rollback those specific files, just let me know.

Oh! Sorry I didn't read that part in the description at the end. I'm not worried about it. Just glad to know that there is a reason and it's not some mysterious git thing that we would be dealing with on the reg.

SimonTopp commented 2 years ago

If I'm reading this correctly, the default is for the early stopping to only be on the fine tuning with the base options (not the pretraining and not with the gw loss option). Is there a reason for this?

@janetrbarclay, the default is to use early stopping with the finetuning (either base options or gw), but not for pretraining because we don't have a validation partition there that can benchmark model performance each epoch..

jsadler2 commented 2 years ago

Looks 👌 to me, @SimonTopp

USGS-R / river-dl

Somewhat significant update #141