This PR splits validation data at the same point in the workflow and in the same manner as the testing data are split. The data are split by lake rather than by sequence. Splitting by lake means that the lakes used for validation data are always different than the lakes used for training data (and different than the lakes used for testing data). That way, validation metrics will better represent the model's performance on new lakes that aren't in the training set. So, early stopping should be even more effective at avoiding overfitting to the training data.
How to run the code
The rule create_training_data creates training, validation, and testing sets. So, to run this part of the pipeline:
snakemake -c1 2_process/out/model_prep/train.npz
That should make train.npz, valid.npz, and test.npz.
How to review this PR
Level of review requested
The code seems to work - no lake site IDs appear in more than one split.
The main things I'd like reviewed are:
Do the pipeline additions and modifications make sense?
Have any errors been introduced into the pipeline?
Where in the code to focus
Anything that's been changed is fair game, but the majority of the changes are in training_data.py.
Issues that will be addressed in upcoming PRs (so don't worry about them yet)
Save more metadata alongside train.npz, validate.npz, and test.npz for use during model evaluation
This PR splits validation data at the same point in the workflow and in the same manner as the testing data are split. The data are split by lake rather than by sequence. Splitting by lake means that the lakes used for validation data are always different than the lakes used for training data (and different than the lakes used for testing data). That way, validation metrics will better represent the model's performance on new lakes that aren't in the training set. So, early stopping should be even more effective at avoiding overfitting to the training data.
How to run the code
The rule
create_training_data
creates training, validation, and testing sets. So, to run this part of the pipeline:That should make
train.npz
,valid.npz
, andtest.npz
.How to review this PR
Level of review requested
The code seems to work - no lake site IDs appear in more than one split.
The main things I'd like reviewed are:
Where in the code to focus
Anything that's been changed is fair game, but the majority of the changes are in
training_data.py
.Issues that will be addressed in upcoming PRs (so don't worry about them yet)