2_process: Create sequenced training data for LSTM

AndyMcAliley commented 2 years ago

2_process takes raw downloaded observations, drivers and attributes and formats it for model training. Closes #9 and #10.

To run the full pipeline: snakemake --snakefile Snakefile -c4 -p --rerun-incomplete. You may choose the number of cores to use in the flag -c4.

There are several steps to get the sequences that comprise training data. Each step is associated with a Snakemake rule. Below, I've broken apart each step into the Snakemake rule name, notes on the Python part of the implementation, and notes on the Snakemake part of the implementation.

Unzip zipped files to 2_process/tmp folder.
- rule: unzip_mntoha
- Implementation:
  - unzip_all calls unzip_file.
- Snakemake:
  - The get_mntoha_input_files function provides the list of zip files to unzip.
  - The output is a log file listing all unzipped files. This output file acts like a dummy file. Later rules can include this output log as an input if they depend directly on the unzipped files.
  - This step is not parallelized. It could be, by somehow associating every zip file with an output file, then calling a summary function to write a combined log file. Creating a separate output file for each zip file seems messy, though. Unzipping doesn't take too long and only needs to be done once, so I didn't think it was worth the extra complexity and effort.
Interpolate temperature observations from arbitrary depths to a fixed set of discrete depths.
- rule: interpolate_mntoha_obs_depths
- Implementation:
  - make_obs_interpolated uses nearest neighbor interpolation to assign each observation to the nearest discrete depth value.
  - The output file is identical to the input file, but with an added column: "interpolated_depth".
- Snakemake:
  - The observation file is zipped upon downloading, so this rule includes the unzip_mntoha log file as an input to make sure it gets unzipped first.
Augment the lake metadata to include all lake attributes to be used as model input features (right now, just add elevation)
- rule: augment_mntoha_lake_metadata
- Implementation:
  - The output file is identical to the input file, but with an added column: "elevation".
  - We may want to augment the metadata with more lake attributes at this step later. That's why the names of the rule and the function aren't specific to adding elevation.
  - This rule takes a while due to a request to the USGS Elevation Point Query Service for all the elevations at each lake's lat/lon centroid.
- Snakemake
  - Unlike interpolate_mntoha_obs_depths, unzip_mntoha's log file isn't included, only fetch_all's output downloaded_files.txt. That's becaues lake_metadata.csv isn't zipped on ScienceBase, so this rule doesn't rely on unzipping.
  - The full path to the metadata csv is 1_fetch/out/metadata_mntoha/lake_metadata.csv. So, it matches the 1_fetch/out/{file_category}_mntoha/{file} pattern of fetch_mtoha_data_file.
Create fixed-length time series sequences of inputs/outputs for one MNTOHA lake. Each sequence will constitute one training example.
- rule: mntoha_lake_sequences
- Implementation:
  - I tried to write this code so that it will be reusable when we switch to another data source beyond MNTOHA, but I didn't spend much time anticipating how that will work.
  - Each lake's sequences get written to a lake-specific .npy file. Those .npy files can be combined later, as needed.
  - This step is the most computationally costly so far.
  - This step was tricky to implement. There are two particularly tricky aspects to it.
    - Take sparse observations specific to a few depths and a few days, and create a big full array of all depths and days where missing observations are NaN. all_dates_depths does the hard work here, using pandas.
    - Divide the big full array into equal length sequences of length sequence_length. The maybe-too-clever-for-its-own-good bit uses np.lib.stride_tricks to do this without having to copy the big full array.
- Snakemake:
  - Drivers, including clarity and ice flags, are zipped upon download. Therefore, this rule includes unzip_mntoha's log file as an input.
  - I used script instead of run here, but the rule could be implemented either way. We could choose to standardize and call either script or run every time, if that would help to simplify things.
Execute step 4 for all lakes in parallel.
- rule: process_mntoha
- Implementation: The Python code is straightforward.
- Snakemake:
  - The input is every lake sequence file that mntoha_lake_sequences should create. That way, when the rule process_mntoha is called, it will trigger each lake sequence file to be built.
    - The function mntoha_lake_sequence_files reads lake_metadata.csv to return that list of every sequence file that should be created.
    - The problem is that we need lake_metadata.csv to create that list of lakes. lake_metadata.csv is itself downloaded during the fetch_all rule, so it's not necessarily available when snakemake is first called. Therefore, Snakemake can't create its full dependency graph up front. To deal with this, I turned the rule fetch_all into a checkpoint. See also here and here for small introductions to checkpoints.
    - The function mntoha_lake_sequence_files calls upon the output of the fetch_all checkpoint. This accomplishes two things.
      1. It makes the process_mntoha rule dependent upon fetch_all.
      2. It lets Snakemake know not to keep running the mntoha_lake_sequence_files function until fetch_all is complete. Since mntoha_lake_sequence_files specifies jobs that process_mntoha depends on, Snakemake knows to hold off on adding those jobs to the dependency graph until after fetch_all has run.
  - Like unzip_mntoha, the output is a summary text file that acts like a dummy file to ensure that all lakes have been processed through mntoha_lake_sequences.
  - The function save_sequences_summary provides that summary text file.

jsadler2 commented 2 years ago

@AndyMcAliley - do you think you could upload an image of the DAG? I'm hoping that won't take you long and would make it easier for me to quickly see what your workflow is doing. You can render just the bit in phase 2_ to an svg with:

snakemake 2_process/out/mntoha_sequences/mntoha_sequences_summary.csv -s 2_process.smk  --dag | dot -Tsvg > dag.svg

Then you'd have to convert that to a .png or .jpg with either an online tool (https://svgtopng.com) or something like Inkscape.

The one hitch might come with having the right libraries (like dot and graphviz) so if it's turning into a rabbit hole, don't worry about it.

AndyMcAliley commented 2 years ago

Thanks for the command! Here it is, @jsadler2: dag I limited the downloaded files and the lake_sequences files to 5 apiece to keep the image small and readable.

AndyMcAliley commented 2 years ago

New DAG, obtained by running snakemake --filegraph | dot -Tsvg > filegraph.svg

filegraph

AndyMcAliley commented 2 years ago

I also reordered the rules and functions in 2_process.smk so that their ordering follows the order they're executed in the pipeline. Closing now!

DOI-USGS / lake-temperature-lstm-static

2_process: Create sequenced training data for LSTM #11