CDCgov / phoenix

🔥🐦🔥PHoeNIx: A short-read pipeline for healthcare-associated and antimicrobial resistant pathogens
Apache License 2.0
50 stars 17 forks source link

[BUG] - Resume flag restarts pipeline at SPADES, even if pipeline has completed downstream steps #129

Open slsevilla opened 6 months ago

slsevilla commented 6 months ago

Describe the bug If I have a pipeline error and I pass nextflow the -resume flag the pipeline restarts at SPADES, even if downstream modules have completed. I think this is because of the way that SPADES is being run. Because the input is given path(full_outdir) that would mean anytime the output dir changes, this module would need to be re-run.

Impact When there is an error in one of the last rules, the majority (and most time-consuming) part of the pipeline has to be re-run, even though these rules have already gone to completion.

To Reproduce

Expected behavior I would expect that the resume flag would continue the pipeline from the point at which it failed.

Additional context I did try to see if I could come up with a quick solution that didn't include rewriting the SPADES module.... If you remove the path(outdir) from the input and edit the code to avoid using the full path variable (I basically stopped the fail log from being created) it runs with a warning (which makes sense given the tuple is incomplete):

WARN: Input tuple does not match input set cardinality declared by process `PHOENIX_SLIM:PHOENIX_EXTERNAL_SLIM:SPADES_WF:SPADES` -- offending value: [[id:v2], [/output/dir/work/82/e8d8d3abb9c77bd86eb3baad10465c/v2_1.trim.fastq.gz, /output/dir/work/82/e8d8d3abb9c77bd86eb3baad10465c/v2_2.trim.fastq.gz], /output/dir/work/c5/aef45b22ff908136d1fb98e291ce76/v2.singles.fastq.gz, /output/dir/work/fb/b3d5f52649fc64c19d8e2aea70f4e0/v2.kraken2_trimd.top_kraken_hit.txt, /output/dir/work/89/5d37276a9cc8e20e0ac46dd90b15d5/v2_raw_read_counts.txt, /output/dir/work/c1/cf4bb3c83f69750599b08228b27da5/v2_trimmed_read_counts.txt, /output/dir/work/a4/48219be006a02dcd541fb808ab5cb5/v2.kraken2_trimd.summary.txt, /output/dir/]

This isn't ideal though, since I can see that this is important for error logging, if SPADES doesn't create the outputs as expected. Hoping you guys might have another cleaner solution, or maybe a different perspective on why this might not be working as expected for me?

jvhagey commented 6 months ago

Hi @slsevilla thanks for your interest in PHX! We have noticed this and haven't totally dug into it so asks for the first troubleshooting pass. My hunch is that a hash gets created for the cache when the *.synopsis and *_summaryline_failure.tsv are initially created and then when we delete things it makes Nextflow think the process didn't complete fully. My initial thoughts on how to get this SPAdes error information different ways either involve a fairly big code rewrite and/or have their own downside. For example, I think we could not run the downstream *.synopsis generation by sample and just give it a directory, but this has the downside of now not being in parallel and could just slow things way down. We will see what we can come up with, but as it not totally breaking this will end up lower on the priority list tbh.

slsevilla commented 5 months ago

I was able to fix this by doing the following:

For anyone looking for the quick and dirty fix, this is it:

# preemptively create _summary_line.csv and .synopsis file in case spades fails (no contigs or scaffolds created) we can still collect upstream stats. 
    ${ica}pipeline_stats_writer_trimd.sh -a ${fastp_raw_qc} -b ${fastp_total_qc} -c ${reads[0]} -d ${reads[1]} -e ${kraken2_trimd_report} -f ${k2_bh_summary} -g ${krona_trimd}

    #get version information
    bspades_version=\$(${ica}beforeSpades.sh -V)
    pipestats_version=\$(${ica}pipeline_stats_writer_trimd.sh -V)
    aspades_version=\$(${ica}afterSpades.sh -V)

    cat <<-END_VERSIONS > versions.yml
    "${task.process}":
        spades: \$(spades.py --version 2>&1 | sed 's/^.*SPAdes genome assembler v//; s/ .*\$//')
        spades_container: ${container}
        \${bspades_version}
        \${aspades_version}
        \${pipestats_version}
    END_VERSIONS

    # Set default to be that spades fails and doesn't create scaffolds or contigs
    spades_complete=run_failure,no_scaffolds,no_contigs
    echo \$spades_complete | tr -d "\\n" > ${prefix}_spades_outcome.csv

    if [[ -z \$(zcat $unpaired_reads) ]]; then
        spades.py \\
            $args \\
            --threads $task.cpus \\
            --memory $maxmem \\
            $input_reads \\
            --phred-offset $phred_offset\\
            -o ./

    else
        spades.py \\
            $args \\
            --threads $task.cpus \\
            --memory $maxmem \\
            $single_reads \\
            $input_reads \\
            --phred-offset $phred_offset\\
            -o ./
    fi

    mv spades.log ${prefix}.spades.log

    # Overwrite default that spades failed
    # Lets downstream process know that spades completed ok - see spades_failure.nf subworkflow
    spades_complete=run_completed
    echo \$spades_complete | tr -d "\\n" > ${prefix}_spades_outcome.csv

    #Create a summaryline file that will be deleted later if spades is successful if not this line shows up in the final Phoenix_output_summary file
    #create file '*_spades_outcome.csv' to state if spades fails, if contigs or scaffolds are created. See spades_failure.nf subworkflow
    #This file will determine if downstream process GENERATE_PIPELINE_STATS_FAILURE and CREATE_SUMMARY_LINE_FAILURE will run (if spades creates contigs, but not scaffolds).
    ${ica}afterSpades.sh

This allows me to pick up after spades, if the pipeline stops for some reason.

I think (for us) a key to the NF workflows is picking up where something fails or stops, so the downside of having the synopsis file failure due to a spades failure is something I'll have to handle when/if it happens (haven't had this yet). I hadn't implemented the 'fairy' rules until this last update and it's happening there too. That's another huge re-write to consider too...