The purpose of pull_date.txt is to provide an easy way to cause the pipeline to re-download everything. However, when it is changed, only lake_metadata.csv is downloaded again. No zip files download.
The cause of this bug is related to the use of multiple checkpoints with a directory as an output of one of the checkpoints. The zip files are inputs to the checkpoint unzip_archive I created a MWE to illustrate and isolate the problem:
import os
from zipfile import ZipFile
def get_outputs(wildcards):
archive_file = checkpoints.get_archive_list.get().output[0]
# archive_file = 'archives.txt'
with open(archive_file, 'r') as f:
archives_lines = f.read().splitlines()
archives = [line for line in archives_lines]
suffixes = ['1', '2']
return [f"out/{archive}/{archive}{suffix}.txt" for archive in archives for suffix in suffixes]
rule all:
input: get_outputs
checkpoint get_archive_list:
input: "date_created.txt"
output: "archives.txt"
shell: "echo 'a' > {output}; echo 'b' >> {output}"
rule get_zip_file:
input: "date_created.txt"
output: "zip/{archive}.zip"
run:
with ZipFile(output[0], 'w') as zf:
# Add multiple files to the zip archive
zf.writestr(wildcards.archive + '1.txt', wildcards.archive + '1 text')
zf.writestr(wildcards.archive + '2.txt', wildcards.archive + '2 text')
checkpoint unzip_archive:
input: "zip/{archive}.zip"
output: directory("data/{archive,[^/]+}")
shell: "unzip {input} -d {output}"
def data_file(wildcards):
# Trigger checkpoint to unzip data file
data_file_directory = checkpoints.unzip_archive.get(archive=wildcards.archive).output[0]
return os.path.join(data_file_directory, wildcards.filename)
rule process_data_file:
input: data_file
output: "out/{archive}/{filename}"
shell: "cp {input} {output}"
Execute this pipeline; everything behaves as expected.
snakemake -c1
But, force the re-execution of zip/a.zip, and dependent jobs do not re-execute.
snakemake -c1 -R zip/a.zip
In get_outputs, if you bypass checkpoint get_archive_list by replacing
then re-executing zip/a.zip does cause downstream rules to re-execute. Also, a similar workflow with two checkpoints that does not have a directory as an output behaves as expected.
The purpose of
pull_date.txt
is to provide an easy way to cause the pipeline to re-download everything. However, when it is changed, onlylake_metadata.csv
is downloaded again. No zip files download.The cause of this bug is related to the use of multiple checkpoints with a
directory
as an output of one of the checkpoints. The zip files are inputs to the checkpointunzip_archive
I created a MWE to illustrate and isolate the problem:Execute this pipeline; everything behaves as expected.
But, force the re-execution of zip/a.zip, and dependent jobs do not re-execute.
In
get_outputs
, if you bypass checkpointget_archive_list
by replacingwith
then re-executing
zip/a.zip
does cause downstream rules to re-execute. Also, a similar workflow with two checkpoints that does not have a directory as an output behaves as expected.