DOI-USGS / lake-temperature-lstm-static

Predict lake temperatures at depth using static lake attributes
Other
0 stars 3 forks source link

Changing `pull_date.txt` does not trigger download #15

Open AndyMcAliley opened 2 years ago

AndyMcAliley commented 2 years ago

The purpose of pull_date.txt is to provide an easy way to cause the pipeline to re-download everything. However, when it is changed, only lake_metadata.csv is downloaded again. No zip files download.

The cause of this bug is related to the use of multiple checkpoints with a directory as an output of one of the checkpoints. The zip files are inputs to the checkpoint unzip_archive I created a MWE to illustrate and isolate the problem:

import os
from zipfile import ZipFile

def get_outputs(wildcards):
    archive_file = checkpoints.get_archive_list.get().output[0]
    # archive_file = 'archives.txt'
    with open(archive_file, 'r') as f:
        archives_lines = f.read().splitlines()
    archives = [line for line in archives_lines]
    suffixes = ['1', '2']
    return [f"out/{archive}/{archive}{suffix}.txt" for archive in archives for suffix in suffixes]

rule all:
    input: get_outputs

checkpoint get_archive_list:
    input: "date_created.txt"
    output: "archives.txt"
    shell: "echo 'a' > {output}; echo 'b' >> {output}"

rule get_zip_file:
    input: "date_created.txt"
    output: "zip/{archive}.zip"
    run: 
        with ZipFile(output[0], 'w') as zf: 
            # Add multiple files to the zip archive
            zf.writestr(wildcards.archive + '1.txt', wildcards.archive + '1 text')
            zf.writestr(wildcards.archive + '2.txt', wildcards.archive + '2 text')

checkpoint unzip_archive:
    input: "zip/{archive}.zip"
    output: directory("data/{archive,[^/]+}")
    shell: "unzip {input} -d {output}"

def data_file(wildcards):
    # Trigger checkpoint to unzip data file
    data_file_directory = checkpoints.unzip_archive.get(archive=wildcards.archive).output[0]
    return os.path.join(data_file_directory, wildcards.filename)

rule process_data_file:
    input: data_file
    output: "out/{archive}/{filename}"
    shell: "cp {input} {output}"

Execute this pipeline; everything behaves as expected.

snakemake -c1

But, force the re-execution of zip/a.zip, and dependent jobs do not re-execute.

snakemake -c1 -R zip/a.zip

In get_outputs, if you bypass checkpoint get_archive_list by replacing

archive_file = checkpoints.get_archive_list.get().output[0]

with

archive_file = 'archives.txt'

then re-executing zip/a.zip does cause downstream rules to re-execute. Also, a similar workflow with two checkpoints that does not have a directory as an output behaves as expected.