This PR ensures that metadata are saved alongside the sequences in the training, validation, and testing sets.

Previously, when data were organized into train.npz, valid.npz, and test.npz as equal-length sequences, information about WHEN those sequences start or WHICH LAKE they belong to was not saved, because that information isn't relevant to training the neural network. However, that information is relevant to evaluating the neural network. Therefore, in this PR those metadata are added to the npz files. That way, test.npz can be used to examine model performance during different seasons and in specific lakes or sets of lakes.

The metadata are saved in the three npz files as as additional arrays with the same length as the number of sequences. That is, there is one start date and one lake ID per sequence.

How to run the code

The rule create_training_data creates training, validation, and testing sets. So, to run this part of the pipeline:

snakemake -c1 2_process/out/model_prep/train.npz

That should make train.npz, valid.npz, and test.npz, now with additional files start_dates and site_ids.

How to review this PR

Level of review requested

The code seems to work - data appear in the npzs where expected.

The main things I'd like reviewed are:

Do the pipeline additions and modifications make sense?
Have any errors been introduced?

Where in the code to focus

Anything that's been changed is fair game, but the majority of the changes are in lake_sequences.py and training_data.py.

Issues that will be addressed in upcoming PRs (so don't worry about them yet)

Changes to 3_train functions that accommodate these updates
Change directory structure to be more nested

DOI-USGS / lake-temperature-lstm-static

Save training metadata #38

How to run the code

How to review this PR

Level of review requested

Where in the code to focus

Issues that will be addressed in upcoming PRs (so don't worry about them yet)