DOI-USGS / lake-temperature-lstm-static

Predict lake temperatures at depth using static lake attributes
Other
0 stars 3 forks source link

Fetch MNTOHA data from ScienceBase using Snakemake #6

Closed AndyMcAliley closed 2 years ago

AndyMcAliley commented 2 years ago

This is a first pass at 1_fetch, using MNTOHA since that download is straightforward. Eventually we'll use more lakes than MNTOHA. This uses Snakemake to manage the pipeline. Closes #1 and closes #4. #2 is addressed except for ensuring that pytorch is gpu-enabled.

Snakemake

This is my first time using Snakemake, so I'm probably not using it optimally. Still, here's a walkthrough of how it's set up. I set this up while consulting the docs, drb-estuary-salinity-ml, and river-dl.

The Snakefile contains all the rules for what needs to be made. The first rule is always the default rule that gets made if you call snakemake from the command line without specifying any particular rule. So, generally, that rule is the "all" rule. Here, it's called fetch_all. It ensures that every file we want is, in fact, downloaded in the proper place. It does that by requiring all those files as inputs. This rule alone has no outputs because it relies on other rules to actually do the downloading - it just lists all the files we want as dependencies. It also has one other input: 1_fetch/in/pull_date.txt. This file just lists a date. It's a dummy file that can be altered when we want to trigger a fresh download.

The way fetch_all requires all the files we want is by referencing the config dictionary, which is loaded from 1_fetch/fetch_config.yaml. There, you'll find every individual file listed, plus ScienceBase IDs.

Running snakemake -c1 -p fetch_all specifies this rule, which downloads everything.

The other rules are "generalized" rules for how to download files with file paths that match a specific pattern. For instance, say we want to download 1_fetch/out/metadata/lake_metadata.csv. Snakemake notices that this file path matches the pattern 1_fetch/out/metadata/*, which is the pattern specified in the output of the fetch_metadata rule. So, Snakemake uses the fetch_metadata rule for that file. Likewise, The fetch_obs rule is for any files downloaded to 1_fetch/out/obs/, and the fetch_driver rule is for any files downloaded to 1_fetch/out/driver/.

The shell directive is followed by the shell command to run. That command is expected to produce the output files in the output directive. Here, the shell commands for the three rules all call the same python script, 1_fetch/src/sb_fetch.py, but with different arguments. The first argument is the ScienceBase ID, and the second argument is the file path. The ScienceBase ID is drawn from the Snakemake config file, found at 1_fetch/fetch_config.yaml. Each rule has its own ScienceBase ID associated with it.

This aspect of the pipeline is a bit brittle. Since the download path is used to determine which rule to use, and each rule is associated with one SB ID, that means that every SB ID has to have its own subdirectory in 1_fetch/out/. Not a huge deal, but I suspect there's a more robust way. It may not matter, though, since we'll eventually move away from MNTOHA and probably SB downloads. I do want to get my head around Snakemake best practices, though.

The good news is that this method makes Snakemake aware of every file to download. This way, files can be downloaded in parallel. If one file is missing or the download failed partway through, calling Snakemake again will cause it to pick up where it left off, without needing to re-download everything from that SB ID.

AndyMcAliley commented 2 years ago

The workflow should be general enough to accommodate future data downloads from other Sciencebase items besides MNTOHA, or from sources other than Sciencebase. I renamed a number of things from "sb" to "mntoha" to try to make room for those future changes. In 1_fetch/out, I also added _mntoha as a suffix to each folder in anticipation of pulling other data later. That suffix can help Snakemake figure out how to do the download.