This PR ensures that metadata are saved alongside the sequences in the training, validation, and testing sets.
Previously, when data were organized into train.npz, valid.npz, and test.npz as equal-length sequences, information about WHEN those sequences start or WHICH LAKE they belong to was not saved, because that information isn't relevant to training the neural network. However, that information is relevant to evaluating the neural network. Therefore, in this PR those metadata are added to the npz files. That way, test.npz can be used to examine model performance during different seasons and in specific lakes or sets of lakes.
The metadata are saved in the three npz files as as additional arrays with the same length as the number of sequences. That is, there is one start date and one lake ID per sequence.
How to run the code
The rule create_training_data creates training, validation, and testing sets. So, to run this part of the pipeline:
snakemake -c1 2_process/out/model_prep/train.npz
That should make train.npz, valid.npz, and test.npz, now with additional files start_dates and site_ids.
How to review this PR
Level of review requested
The code seems to work - data appear in the npzs where expected.
The main things I'd like reviewed are:
Do the pipeline additions and modifications make sense?
Have any errors been introduced?
Where in the code to focus
Anything that's been changed is fair game, but the majority of the changes are in lake_sequences.py and training_data.py.
Issues that will be addressed in upcoming PRs (so don't worry about them yet)
Changes to 3_train functions that accommodate these updates
This PR ensures that metadata are saved alongside the sequences in the training, validation, and testing sets.
Previously, when data were organized into
train.npz
,valid.npz
, andtest.npz
as equal-length sequences, information about WHEN those sequences start or WHICH LAKE they belong to was not saved, because that information isn't relevant to training the neural network. However, that information is relevant to evaluating the neural network. Therefore, in this PR those metadata are added to the npz files. That way,test.npz
can be used to examine model performance during different seasons and in specific lakes or sets of lakes.The metadata are saved in the three npz files as as additional arrays with the same length as the number of sequences. That is, there is one start date and one lake ID per sequence.
How to run the code
The rule
create_training_data
creates training, validation, and testing sets. So, to run this part of the pipeline:That should make
train.npz
,valid.npz
, andtest.npz
, now with additional filesstart_dates
andsite_ids
.How to review this PR
Level of review requested
The code seems to work - data appear in the npzs where expected.
The main things I'd like reviewed are:
Where in the code to focus
Anything that's been changed is fair game, but the majority of the changes are in
lake_sequences.py
andtraining_data.py
.Issues that will be addressed in upcoming PRs (so don't worry about them yet)