genepi / imputationserver2

MIT License
11 stars 3 forks source link

[question] Possible untracked outputs from workflow step #6

Open abought opened 1 year ago

abought commented 1 year ago

Question

While debugging the workflow one command at a time, I noticed that a command generated additional output files not tracked in the workflow.

When I run in s3 mode, these files don't get sent back to the working directory. Should they be declared as outputs?

Steps to reproduce

Step: IMPUTATIONSERVERINPUT_VALIDATIONINPUT_VALIDATION_VCF Manually extracted and ran local command: java -Xmx8192M -jar ./imputationserver-utils.jar validate --population mixed --phasing eagle --reference example-reference-panel.json --build hg38 --mode imputation --minSamples 20 --maxSamples 25000 --report cloudgene.report.json /datasets/imputation/chr22_1000g_example/chr22.OmniExpress.1K.1000G_b38.chr.vcf.gz

Actual declared outputs in workflow

Declares that the input file is used, unmodified, as output to the next step. https://github.com/genepi/nf-imputationserver/blob/cfd9a05dfa5161dae3855f936a275a60351bd00f/modules/local/input_validation/input_validation_vcf.nf#L10-L11

Actual outputs of interest

Additional files are generated inside the container; if not exported, they are removed when the container is stopped:

Suggestions

Would nextflow capture files back to s3 workdir, even if undeclared?

Looking at all s3 folders used by the job, the answer appears to be "no": the s3 bucket workdir does not stage extra files.

Rather than trust one edge case, I next evaluated the staging script command.run that nextflow generates when using AWS batch. From the snippet definition below, it seems that only declared inputs/outputs are subject to staging/unstaging to workdir. This might explain why other files (like tbi and report) are not found in the workdir after the job completes.

nxf_stage() {
    true
    # stage input files
    downloads=(true)
    rm -f chr22.OmniExpress.1K.1000G_b38.chr.vcf.gz
    rm -f .command.sh
    rm -f .command.run
    downloads+=("nxf_s3_download s3://pouncer-development-nextflow-storage/inputs/chr22.OmniExpress.1K.1000G_b38.chr.vcf.gz chr22.OmniExpress.1K.1000G_b38.chr.vcf.gz")
    downloads+=("nxf_s3_download s3://pouncer-development-nextflow-storage/runs/2023-10-25-21-44-35-xbzru/workdir/67/599c830cb641ceca4d0b01b1808849/.command.sh .command.sh")
    downloads+=("nxf_s3_download s3://pouncer-development-nextflow-storage/runs/2023-10-25-21-44-35-xbzru/workdir/67/599c830cb641ceca4d0b01b1808849/.command.run .command.run")
    nxf_parallel "${downloads[@]}"
}

nxf_unstage() {
    true
    [[ ${nxf_main_ret:=0} != 0 ]] && return
}