Closed AndyMcAliley closed 2 years ago
The workflow should be general enough to accommodate future data downloads from other Sciencebase items besides MNTOHA, or from sources other than Sciencebase. I renamed a number of things from "sb" to "mntoha" to try to make room for those future changes. In 1_fetch/out
, I also added _mntoha
as a suffix to each folder in anticipation of pulling other data later. That suffix can help Snakemake figure out how to do the download.
This is a first pass at
1_fetch
, using MNTOHA since that download is straightforward. Eventually we'll use more lakes than MNTOHA. This uses Snakemake to manage the pipeline. Closes #1 and closes #4. #2 is addressed except for ensuring that pytorch is gpu-enabled.Snakemake
This is my first time using
Snakemake
, so I'm probably not using it optimally. Still, here's a walkthrough of how it's set up. I set this up while consulting the docs, drb-estuary-salinity-ml, and river-dl.The Snakefile contains all the rules for what needs to be made. The first rule is always the default rule that gets made if you call
snakemake
from the command line without specifying any particular rule. So, generally, that rule is the "all" rule. Here, it's calledfetch_all
. It ensures that every file we want is, in fact, downloaded in the proper place. It does that by requiring all those files as inputs. This rule alone has no outputs because it relies on other rules to actually do the downloading - it just lists all the files we want as dependencies. It also has one other input:1_fetch/in/pull_date.txt
. This file just lists a date. It's a dummy file that can be altered when we want to trigger a fresh download.The way
fetch_all
requires all the files we want is by referencing the config dictionary, which is loaded from1_fetch/fetch_config.yaml
. There, you'll find every individual file listed, plus ScienceBase IDs.Running
snakemake -c1 -p fetch_all
specifies this rule, which downloads everything.The other rules are "generalized" rules for how to download files with file paths that match a specific pattern. For instance, say we want to download
1_fetch/out/metadata/lake_metadata.csv
.Snakemake
notices that this file path matches the pattern1_fetch/out/metadata/*
, which is the pattern specified in the output of thefetch_metadata
rule. So,Snakemake
uses thefetch_metadata
rule for that file. Likewise, Thefetch_obs
rule is for any files downloaded to1_fetch/out/obs/
, and thefetch_driver
rule is for any files downloaded to1_fetch/out/driver/
.The
shell
directive is followed by the shell command to run. That command is expected to produce the output files in theoutput
directive. Here, theshell
commands for the three rules all call the same python script,1_fetch/src/sb_fetch.py
, but with different arguments. The first argument is the ScienceBase ID, and the second argument is the file path. The ScienceBase ID is drawn from theSnakemake
config file, found at1_fetch/fetch_config.yaml
. Each rule has its own ScienceBase ID associated with it.This aspect of the pipeline is a bit brittle. Since the download path is used to determine which rule to use, and each rule is associated with one SB ID, that means that every SB ID has to have its own subdirectory in
1_fetch/out/
. Not a huge deal, but I suspect there's a more robust way. It may not matter, though, since we'll eventually move away from MNTOHA and probably SB downloads. I do want to get my head aroundSnakemake
best practices, though.The good news is that this method makes
Snakemake
aware of every file to download. This way, files can be downloaded in parallel. If one file is missing or the download failed partway through, callingSnakemake
again will cause it to pick up where it left off, without needing to re-download everything from that SB ID.