Interpolate predicted temperatures from fixed depths to depths of observed temperatures.

The LSTM predicts temperatures at a set of fixed depths at every time step (day). Here we interpolate those predictions to the depths of every observation in a training, validation, or test set for the sake of model evaluation.

The trickiest part is to match each observation to the right prediction sequence. Training, validation, and testing sequences are formed as follows. Every sequence is a fixed number of days (set to 400 days right now). The first part of the sequence is "spin-up" or "burn-in" time (set to 100 days right now). Only predictions after spin-up are used for model evaluation. Sequences can overlap (set to 200 days right now). Only sequences that have an observation occur after spin-up are included in a dataset. Every observation for which we have NLDAS driver data has at least one corresponding sequence. Because of sequence overlap, some observations have more than one corresponding sequence.

When it comes to finding the right sequence for each observation, performance matters. My original implementation was a set of nested for loops over all observations and all sequences. It was on track to take 20 hours for the validation set. The current implementation takes 2-3 minutes. I'm using a combination of batch processing, broadcasting, np.nonzero, and np.where to take advantage of vectorized operations.

The question of how to evaluate model performance was first raised in lake-temperature-process-models, here. With this PR we can evaluate models in the same way as lake-temperature-process-models.

How to run the code

The rule interpolate_predictions is the new addition. So, to run this part of the pipeline:

snakemake -c1 -s 4_evaluate.smk 4_evaluate/out/model_prep/initial/cpu_a/interpolated_predictions_valid.csv

That should make interpolated_predictions_valid.csv with the column predicted_temperature_obs_depth that contains the interpolated predictions.

NOTE: Running the command above requires a compiled dataset and a trained model each of which takes several hours to obtain, plus a set of observed temperatures. If you want, copy the following files from Tallgrass, preserving their directory structure relative to the repo root directory:

2_process/out/model_prep/valid.npz

2_process/tmp/model_prep/temperature_observations_interpolated.csv

3_train/out/model_prep/initial/cpu_a_metadata.npz

3_train/out/model_prep/initial/cpu_a_weights.npz

How to review this PR

The main things I'd like reviewed are:

Do the pipeline additions and modifications make sense?
Have any errors been introduced into the pipeline?

Closes #46.

DOI-USGS / lake-temperature-lstm-static

Interpolate predictions #47

How to run the code

How to review this PR