This PR splits validation data at the same point in the workflow and in the same manner as the testing data are split. The data are split by lake rather than by sequence. Splitting by lake means that the lakes used for validation data are always different than the lakes used for training data (and different than the lakes used for testing data). That way, validation metrics will better represent the model's performance on new lakes that aren't in the training set. So, early stopping should be even more effective at avoiding overfitting to the training data.

How to run the code

The rule create_training_data creates training, validation, and testing sets. So, to run this part of the pipeline:

snakemake -c1 2_process/out/model_prep/train.npz

That should make train.npz, valid.npz, and test.npz.

How to review this PR

Level of review requested

The code seems to work - no lake site IDs appear in more than one split.

The main things I'd like reviewed are:

Do the pipeline additions and modifications make sense?
Have any errors been introduced into the pipeline?

Where in the code to focus

Anything that's been changed is fair game, but the majority of the changes are in training_data.py.

Issues that will be addressed in upcoming PRs (so don't worry about them yet)

Save more metadata alongside train.npz, validate.npz, and test.npz for use during model evaluation
Change directory structure to be more nested

DOI-USGS / lake-temperature-lstm-static

Split validation set during 2_process #37

How to run the code

How to review this PR

Level of review requested

Where in the code to focus

Issues that will be addressed in upcoming PRs (so don't worry about them yet)