Use regular expressions to ensure that unzip_archive's output does not match files in subdirectories of 1_fetch/out/ or csv files in 1_fetch/out. Closes #13.
unzip_archive is a Snakemake checkpoint to unzip files from a zipped archive. It is a checkpoint because otherwise Snakemake won't track unzipped files from an archive and will delete or ignore them. Its output is a directory because we don't know how many unzipped files there will be, but we know which directory they'll be in after they are unzipped.
The output is the name of the directory that files are extracted to. Regular expressions ensure that files in subdirectories of 1_fetch/out/{file_category} don't get matched, and csvs in 1_fetch/out/ don't get matched.
First off, ignore the regular expression stuff and focusing on the output string with wildcards only.
"1_fetch/out/{file_category}/{archive_name}"
file_category is the type of files we're unzipping, like "dynamic_mntoha" or "obs_mntoha".
archive_name is the name of the zipped archive itself (without the .zip extension), and also the name of the folder that the archive is being extracted to. For instance, if we unzip the MNTOHA clarity zip archive named clarity_06_N46.00-47.00_W94.50-97.00.zip, we'll extract its contents to 1_fetch/out/dynamic_mntoha/clarity_06_N46.00-47.00_W94.50-97.00. Then, file_category is dynamic_mntoha and archive_name is clarity_06_N46.00-47.00_W94.50-97.00.
directory(...) tells Snakemake to treat this output as a directory and not a file. You can read more about it here.
folder = directory(...) allows us to reference this output by keyword, as checkpoints.unzip_archive.get().output.folder.
Snakemake allows regular expressions to match wildcards by placing them after a comma, like this: "{wildcard,regex}".
[^/] means any character that is not / (so, not a subdirectory), and + means one or more of the previous. So, {file_category,[^/]+} means that file_category is matched to any string that doesn't have a / in it.
$ means to the end of the string, and (?<!string) is a negative lookbehind. So, $(?<!\.csv) means don't match if the final characters in the string are .csv. Therefore, {archive_name,[^/]+$(?<!\.csv)} means that archive_name is matched to any string that doesn't have a / in it, and that doesn't end in .csv.
Use regular expressions to ensure that
unzip_archive
's output does not match files in subdirectories of1_fetch/out/
or csv files in1_fetch/out
. Closes #13.unzip_archive
is a Snakemake checkpoint to unzip files from a zipped archive. It is a checkpoint because otherwise Snakemake won't track unzipped files from an archive and will delete or ignore them. Its output is a directory because we don't know how many unzipped files there will be, but we know which directory they'll be in after they are unzipped.The output is the name of the directory that files are extracted to. Regular expressions ensure that files in subdirectories of
1_fetch/out/{file_category}
don't get matched, andcsv
s in1_fetch/out/
don't get matched.Syntax explanation
The major change happens in one line of code:
First off, ignore the regular expression stuff and focusing on the output string with wildcards only.
file_category
is the type of files we're unzipping, like "dynamic_mntoha" or "obs_mntoha".archive_name
is the name of the zipped archive itself (without the.zip
extension), and also the name of the folder that the archive is being extracted to. For instance, if we unzip the MNTOHA clarity zip archive namedclarity_06_N46.00-47.00_W94.50-97.00.zip
, we'll extract its contents to1_fetch/out/dynamic_mntoha/clarity_06_N46.00-47.00_W94.50-97.00
. Then,file_category
isdynamic_mntoha
andarchive_name
isclarity_06_N46.00-47.00_W94.50-97.00
.directory(...)
tells Snakemake to treat this output as a directory and not a file. You can read more about it here.folder = directory(...)
allows us to reference this output by keyword, ascheckpoints.unzip_archive.get().output.folder
.Snakemake allows regular expressions to match wildcards by placing them after a comma, like this:
"{wildcard,regex}"
.[^/]
means any character that is not/
(so, not a subdirectory), and+
means one or more of the previous. So,{file_category,[^/]+}
means thatfile_category
is matched to any string that doesn't have a/
in it.$
means to the end of the string, and(?<!string)
is a negative lookbehind. So,$(?<!\.csv)
means don't match if the final characters in the string are.csv
. Therefore,{archive_name,[^/]+$(?<!\.csv)}
means thatarchive_name
is matched to any string that doesn't have a/
in it, and that doesn't end in.csv
.Put all this together and you get: