[question] Possible untracked outputs from workflow step

Question

While debugging the workflow one command at a time, I noticed that a command generated additional output files not tracked in the workflow.

When I run in s3 mode, these files don't get sent back to the working directory. Should they be declared as outputs?

Steps to reproduce

Step: IMPUTATIONSERVERINPUT_VALIDATIONINPUT_VALIDATION_VCF Manually extracted and ran local command: java -Xmx8192M -jar ./imputationserver-utils.jar validate --population mixed --phasing eagle --reference example-reference-panel.json --build hg38 --mode imputation --minSamples 20 --maxSamples 25000 --report cloudgene.report.json /datasets/imputation/chr22_1000g_example/chr22.OmniExpress.1K.1000G_b38.chr.vcf.gz

Actual declared outputs in workflow

Declares that the input file is used, unmodified, as output to the next step. https://github.com/genepi/nf-imputationserver/blob/cfd9a05dfa5161dae3855f936a275a60351bd00f/modules/local/input_validation/input_validation_vcf.nf#L10-L11

Actual outputs of interest

Additional files are generated inside the container; if not exported, they are removed when the container is stopped:

Cloudgene json file, which is gathered at the end of the workflow: cloudgene.report.json
Tabix index for the dataset file (in same folder as original input file, with .tbi extension)

Suggestions

Consider adding the tabix index and the the cloudgene.report.json file to NF outputs declaration, so they are staged back to s3 at end of step (at present, even though step completes, s3 bucket workdir does not contain these files)
Also echo the contents of cloudgene.report.json to stdout. This is captured by the AWS batch -> cloudwatch integration as well as nf command.out. Copying info to the standard mechanism makes it a little easier to debug missing steps when something goes wrong
If we are tabix-indexing the input file, are we also running bgzip? If so, then revisit the use of output: path("*.vcf.gz"), includeInputs: true. (we're tracking mutations, not just filenames)

Would nextflow capture files back to s3 workdir, even if undeclared?

Looking at all s3 folders used by the job, the answer appears to be "no": the s3 bucket workdir does not stage extra files.

Rather than trust one edge case, I next evaluated the staging script command.run that nextflow generates when using AWS batch. From the snippet definition below, it seems that only declared inputs/outputs are subject to staging/unstaging to workdir. This might explain why other files (like tbi and report) are not found in the workdir after the job completes.

nxf_stage() {
    true
    # stage input files
    downloads=(true)
    rm -f chr22.OmniExpress.1K.1000G_b38.chr.vcf.gz
    rm -f .command.sh
    rm -f .command.run
    downloads+=("nxf_s3_download s3://pouncer-development-nextflow-storage/inputs/chr22.OmniExpress.1K.1000G_b38.chr.vcf.gz chr22.OmniExpress.1K.1000G_b38.chr.vcf.gz")
    downloads+=("nxf_s3_download s3://pouncer-development-nextflow-storage/runs/2023-10-25-21-44-35-xbzru/workdir/67/599c830cb641ceca4d0b01b1808849/.command.sh .command.sh")
    downloads+=("nxf_s3_download s3://pouncer-development-nextflow-storage/runs/2023-10-25-21-44-35-xbzru/workdir/67/599c830cb641ceca4d0b01b1808849/.command.run .command.run")
    nxf_parallel "${downloads[@]}"
}

nxf_unstage() {
    true
    [[ ${nxf_main_ret:=0} != 0 ]] && return
}

genepi / imputationserver2