Closed AndyMcAliley closed 2 years ago
@AndyMcAliley - do you think you could upload an image of the DAG? I'm hoping that won't take you long and would make it easier for me to quickly see what your workflow is doing. You can render just the bit in phase 2_
to an svg
with:
snakemake 2_process/out/mntoha_sequences/mntoha_sequences_summary.csv -s 2_process.smk --dag | dot -Tsvg > dag.svg
Then you'd have to convert that to a .png or .jpg with either an online tool (https://svgtopng.com) or something like Inkscape.
The one hitch might come with having the right libraries (like dot and graphviz) so if it's turning into a rabbit hole, don't worry about it.
Thanks for the command! Here it is, @jsadler2: I limited the downloaded files and the lake_sequences files to 5 apiece to keep the image small and readable.
New DAG, obtained by running snakemake --filegraph | dot -Tsvg > filegraph.svg
I also reordered the rules and functions in 2_process.smk
so that their ordering follows the order they're executed in the pipeline. Closing now!
2_process
takes raw downloaded observations, drivers and attributes and formats it for model training. Closes #9 and #10.To run the full pipeline:
snakemake --snakefile Snakefile -c4 -p --rerun-incomplete
. You may choose the number of cores to use in the flag-c4
.There are several steps to get the sequences that comprise training data. Each step is associated with a Snakemake rule. Below, I've broken apart each step into the Snakemake rule name, notes on the Python part of the implementation, and notes on the Snakemake part of the implementation.
rule: unzip_mntoha
unzip_all
callsunzip_file
.get_mntoha_input_files
function provides the list of zip files to unzip.rule: interpolate_mntoha_obs_depths
make_obs_interpolated
uses nearest neighbor interpolation to assign each observation to the nearest discrete depth value.unzip_mntoha
log file as an input to make sure it gets unzipped first.rule: augment_mntoha_lake_metadata
interpolate_mntoha_obs_depths
,unzip_mntoha
's log file isn't included, onlyfetch_all
's outputdownloaded_files.txt
. That's becaueslake_metadata.csv
isn't zipped on ScienceBase, so this rule doesn't rely on unzipping.1_fetch/out/metadata_mntoha/lake_metadata.csv
. So, it matches the1_fetch/out/{file_category}_mntoha/{file}
pattern offetch_mtoha_data_file
.rule: mntoha_lake_sequences
all_dates_depths
does the hard work here, usingpandas
.sequence_length
. The maybe-too-clever-for-its-own-good bit usesnp.lib.stride_tricks
to do this without having to copy the big full array.unzip_mntoha
's log file as an input.script
instead ofrun
here, but the rule could be implemented either way. We could choose to standardize and call eitherscript
orrun
every time, if that would help to simplify things.rule: process_mntoha
mntoha_lake_sequences
should create. That way, when the ruleprocess_mntoha
is called, it will trigger each lake sequence file to be built.mntoha_lake_sequence_files
readslake_metadata.csv
to return that list of every sequence file that should be created.lake_metadata.csv
to create that list of lakes.lake_metadata.csv
is itself downloaded during thefetch_all
rule, so it's not necessarily available when snakemake is first called. Therefore, Snakemake can't create its full dependency graph up front. To deal with this, I turned the rulefetch_all
into a checkpoint. See also here and here for small introductions to checkpoints.mntoha_lake_sequence_files
calls upon the output of thefetch_all
checkpoint. This accomplishes two things.process_mntoha
rule dependent uponfetch_all
.mntoha_lake_sequence_files
function untilfetch_all
is complete. Sincemntoha_lake_sequence_files
specifies jobs thatprocess_mntoha
depends on, Snakemake knows to hold off on adding those jobs to the dependency graph until afterfetch_all
has run.unzip_mntoha
, the output is a summary text file that acts like a dummy file to ensure that all lakes have been processed throughmntoha_lake_sequences
.save_sequences_summary
provides that summary text file.