Invalid argument when running proxy container

Hi all,

I'm encountering an issue when using the AWS Batch backend. I'm using the EFS (local) file system for the backend, not S3.

I've got a workflow that downloads fastq files as an initial step. These jobs fail non-deterministically a fraction of the time. These jobs are a scatter over an input array of fastq files, and most of them generally complete. However, 20% of the shards might fail in any given scatter.

A complete job will have the following outputs in the shard output folder:

download_fastq-0-rc.txt  
download_fastq-0-stderr.log  
download_fastq-0-stdout.log  
input_fastq_specified_R1.fq.gz  
script  
tmp.71626c8d/

When cromwell submits the job, it auto-generates a script to download the fastq. It's a very simple job, so here's an example script:

#!/bin/bash

cd /EFSROOT/cromwell_execution/FE_Somatic_Mutect2/ed746149-883f-4ef1-8b95-d3e9d7cd1423/call-download_normal/shard-1
tmpDir=$(mkdir -p "/gstore/cromwell_execution/FE_Somatic_Mutect2/ed746149-883f-4ef1-8b95-d3e9d7cd1423/call-download_normal/shard-1/tmp.bf92fa27" && echo "/gstore/cromwell_execution/FE_Somatic_Mutect2/ed746149-883f-4ef1-8b95-d3e9d7cd1423/call-download_normal/shard-1/tmp.bf92fa27")
chmod 777 "$tmpDir"
export _JAVA_OPTIONS=-Djava.io.tmpdir="$tmpDir"
export TMPDIR="$tmpDir"
export HOME="$HOME"
(
cd /gstore/cromwell_execution/FE_Somatic_Mutect2/ed746149-883f-4ef1-8b95-d3e9d7cd1423/call-download_normal/shard-1

)
outed746149="${tmpDir}/out.$$" erred746149="${tmpDir}/err.$$"
mkfifo "$outed746149" "$erred746149"
trap 'rm "$outed746149" "$erred746149"' EXIT
tee '/gstore/cromwell_execution/FE_Somatic_Mutect2/ed746149-883f-4ef1-8b95-d3e9d7cd1423/call-download_normal/shard-1/download_normal-1-stdout.log' < "$outed746149" &
tee '/gstore/cromwell_execution/FE_Somatic_Mutect2/ed746149-883f-4ef1-8b95-d3e9d7cd1423/call-download_normal/shard-1/download_normal-1-stderr.log' < "$erred746149" >&2 &
(
cd /gstore/cromwell_execution/FE_Somatic_Mutect2/ed746149-883f-4ef1-8b95-d3e9d7cd1423/call-download_normal/shard-1

/usr/bin/aws s3 cp s3://pipeline.poc/sampledata/PSNL/FASTQS/HCC-1187BL-replicate_CAATGAGC-TATCGCAC.merged_R2.fq.gz .
)  > "$outed746149" 2> "$erred746149"
echo $? > /gstore/cromwell_execution/FE_Somatic_Mutect2/ed746149-883f-4ef1-8b95-d3e9d7cd1423/call-download_normal/shard-1/download_normal-1-rc.txt.tmp
(
# add a .file in every empty directory to facilitate directory delocalization on the cloud
cd /gstore/cromwell_execution/FE_Somatic_Mutect2/ed746149-883f-4ef1-8b95-d3e9d7cd1423/call-download_normal/shard-1
find . -type d -exec sh -c '[ -z "$(ls -A '"'"'{}'"'"')" ] && touch '"'"'{}'"'"'/.file' \;
)
(
cd /gstore/cromwell_execution/FE_Somatic_Mutect2/ed746149-883f-4ef1-8b95-d3e9d7cd1423/call-download_normal/shard-1
sync

)
mv /gstore/cromwell_execution/FE_Somatic_Mutect2/ed746149-883f-4ef1-8b95-d3e9d7cd1423/call-download_normal/shard-1/download_normal-1-rc.txt.tmp /gstore/cromwell_execution/FE_Somatic_Mutect2/ed746149-883f-4ef1-8b95-d3e9d7cd1423/call-download_normal/shard-1/download_normal-1-rc.txt

In this example, shard-0 succeeds and shard-1 fails, with this error messages, retrieved from AWS batch cloud watch logs:

AWS log of failed container job:

AWS log of failed container job-proxy:

In other examples, both succeed, both fail, or shard-0 fails and shard-1 succeeds. It doesn't seem to matter. The error is always the same, from executing the script inside the container: INVALID ARGUMENT (as shown above)

I don't think it has to do with the nature of the job (downloading a fastq) since the error isn't regarding the actual command. It's more about the communication of the job to temporary stdout / err files (I think).

If anyone has seen this or has any advice, please help. Thanks

broadinstitute / cromwell

Invalid argument when running proxy container #5421