This PR creates training data using data sourced from the lake-temperature-model-prep pipeline. It does the following:
Pair NLDAS drivers with lakes using the meteo crosswalk from 7_config_merge/out/nml_Kw_values.rds
Add static clarity values to the lake metadata (not being used for training yet)
Change the model-prep rules that form the lake metadata and meteo crosswalk csvs into checkpoints. They need to be checkpoints because their outputs are used inside functions dynamic_filenames_model_prep and get_lake_sequence_files. If they weren't checkpoints, those files wouldn't be created before the functions are called, and the pipeline wouldn't work.
Altered lake_sequences.py to accommodate both MNTOHA and model-prep data
Here's a DAG for forming the model-prep training data:
How to run the code
I've built out the pipeline up through rule create_training_data. So, to run this part of the pipeline:
snakemake -c1 2_process/out/model_prep/train.npz
How to review this PR
Level of review requested
The main things I'd like reviewed are:
Do the pipeline additions and modifications make sense?
Have any errors been introduced into the existing MNTOHA pipeline?
Where in the code to focus
Anything that's been changed is fair game, but the majority of the changes are in 2_process.smk.
Issues that will be addressed in upcoming PRs (so don't worry about them yet)
Form validation set during 2_process
Save more metadata alongside train.npz, validate.npz, and test.npz for use during model evaluation
This PR creates training data using data sourced from the lake-temperature-model-prep pipeline. It does the following:
rule
s that form the lake metadata and meteo crosswalk csvs intocheckpoints
. They need to be checkpoints because their outputs are used inside functionsdynamic_filenames_model_prep
andget_lake_sequence_files
. If they weren't checkpoints, those files wouldn't be created before the functions are called, and the pipeline wouldn't work.lake_sequences.py
to accommodate both MNTOHA and model-prep dataHere's a DAG for forming the model-prep training data:
How to run the code
I've built out the pipeline up through rule
create_training_data
. So, to run this part of the pipeline:How to review this PR
Level of review requested
The main things I'd like reviewed are:
Where in the code to focus
Anything that's been changed is fair game, but the majority of the changes are in
2_process.smk
.Issues that will be addressed in upcoming PRs (so don't worry about them yet)