Use regular expressions to ensure that unzip_archive's output does not match files in subdirectories of 1_fetch/out/ or csv files in 1_fetch/out. Closes #13.

unzip_archive is a Snakemake checkpoint to unzip files from a zipped archive. It is a checkpoint because otherwise Snakemake won't track unzipped files from an archive and will delete or ignore them. Its output is a directory because we don't know how many unzipped files there will be, but we know which directory they'll be in after they are unzipped.

The output is the name of the directory that files are extracted to. Regular expressions ensure that files in subdirectories of 1_fetch/out/{file_category} don't get matched, and csvs in 1_fetch/out/ don't get matched.

Syntax explanation

The major change happens in one line of code:

folder = directory("1_fetch/out/{file_category,[^/]+}/{archive_name,[^/]+$(?<!\.csv)}")

First off, ignore the regular expression stuff and focusing on the output string with wildcards only.

"1_fetch/out/{file_category}/{archive_name}"

file_category is the type of files we're unzipping, like "dynamic_mntoha" or "obs_mntoha".
archive_name is the name of the zipped archive itself (without the .zip extension), and also the name of the folder that the archive is being extracted to. For instance, if we unzip the MNTOHA clarity zip archive named clarity_06_N46.00-47.00_W94.50-97.00.zip, we'll extract its contents to 1_fetch/out/dynamic_mntoha/clarity_06_N46.00-47.00_W94.50-97.00. Then, file_category is dynamic_mntoha and archive_name is clarity_06_N46.00-47.00_W94.50-97.00.
directory(...) tells Snakemake to treat this output as a directory and not a file. You can read more about it here.
folder = directory(...) allows us to reference this output by keyword, as checkpoints.unzip_archive.get().output.folder.
Snakemake allows regular expressions to match wildcards by placing them after a comma, like this: "{wildcard,regex}".
[^/] means any character that is not / (so, not a subdirectory), and + means one or more of the previous. So, {file_category,[^/]+} means that file_category is matched to any string that doesn't have a / in it.
$ means to the end of the string, and (?<!string) is a negative lookbehind. So, $(?<!\.csv) means don't match if the final characters in the string are .csv. Therefore, {archive_name,[^/]+$(?<!\.csv)} means that archive_name is matched to any string that doesn't have a / in it, and that doesn't end in .csv.

Put all this together and you get:

folder = directory("1_fetch/out/{file_category,[^/]+}/{archive_name,[^/]+$(?<!\.csv)}")

DOI-USGS / lake-temperature-lstm-static

Stop unzip_archive from matching files #22

Syntax explanation