broadinstitute / cromwell

Scientific workflow engine designed for simplicity & scalability. Trivially transition between one off use cases to massive scale production environments
http://cromwell.readthedocs.io/
BSD 3-Clause "New" or "Revised" License
989 stars 358 forks source link

Invalid argument when running proxy container #5421

Open pjongeneel opened 4 years ago

pjongeneel commented 4 years ago

Hi all,

I'm encountering an issue when using the AWS Batch backend. I'm using the EFS (local) file system for the backend, not S3.

I've got a workflow that downloads fastq files as an initial step. These jobs fail non-deterministically a fraction of the time. These jobs are a scatter over an input array of fastq files, and most of them generally complete. However, 20% of the shards might fail in any given scatter.

A complete job will have the following outputs in the shard output folder:

download_fastq-0-rc.txt  
download_fastq-0-stderr.log  
download_fastq-0-stdout.log  
input_fastq_specified_R1.fq.gz  
script  
tmp.71626c8d/

When cromwell submits the job, it auto-generates a script to download the fastq. It's a very simple job, so here's an example script:

#!/bin/bash

cd /EFSROOT/cromwell_execution/FE_Somatic_Mutect2/ed746149-883f-4ef1-8b95-d3e9d7cd1423/call-download_normal/shard-1
tmpDir=$(mkdir -p "/gstore/cromwell_execution/FE_Somatic_Mutect2/ed746149-883f-4ef1-8b95-d3e9d7cd1423/call-download_normal/shard-1/tmp.bf92fa27" && echo "/gstore/cromwell_execution/FE_Somatic_Mutect2/ed746149-883f-4ef1-8b95-d3e9d7cd1423/call-download_normal/shard-1/tmp.bf92fa27")
chmod 777 "$tmpDir"
export _JAVA_OPTIONS=-Djava.io.tmpdir="$tmpDir"
export TMPDIR="$tmpDir"
export HOME="$HOME"
(
cd /gstore/cromwell_execution/FE_Somatic_Mutect2/ed746149-883f-4ef1-8b95-d3e9d7cd1423/call-download_normal/shard-1

)
outed746149="${tmpDir}/out.$$" erred746149="${tmpDir}/err.$$"
mkfifo "$outed746149" "$erred746149"
trap 'rm "$outed746149" "$erred746149"' EXIT
tee '/gstore/cromwell_execution/FE_Somatic_Mutect2/ed746149-883f-4ef1-8b95-d3e9d7cd1423/call-download_normal/shard-1/download_normal-1-stdout.log' < "$outed746149" &
tee '/gstore/cromwell_execution/FE_Somatic_Mutect2/ed746149-883f-4ef1-8b95-d3e9d7cd1423/call-download_normal/shard-1/download_normal-1-stderr.log' < "$erred746149" >&2 &
(
cd /gstore/cromwell_execution/FE_Somatic_Mutect2/ed746149-883f-4ef1-8b95-d3e9d7cd1423/call-download_normal/shard-1

/usr/bin/aws s3 cp s3://pipeline.poc/sampledata/PSNL/FASTQS/HCC-1187BL-replicate_CAATGAGC-TATCGCAC.merged_R2.fq.gz .
)  > "$outed746149" 2> "$erred746149"
echo $? > /gstore/cromwell_execution/FE_Somatic_Mutect2/ed746149-883f-4ef1-8b95-d3e9d7cd1423/call-download_normal/shard-1/download_normal-1-rc.txt.tmp
(
# add a .file in every empty directory to facilitate directory delocalization on the cloud
cd /gstore/cromwell_execution/FE_Somatic_Mutect2/ed746149-883f-4ef1-8b95-d3e9d7cd1423/call-download_normal/shard-1
find . -type d -exec sh -c '[ -z "$(ls -A '"'"'{}'"'"')" ] && touch '"'"'{}'"'"'/.file' \;
)
(
cd /gstore/cromwell_execution/FE_Somatic_Mutect2/ed746149-883f-4ef1-8b95-d3e9d7cd1423/call-download_normal/shard-1
sync

)
mv /gstore/cromwell_execution/FE_Somatic_Mutect2/ed746149-883f-4ef1-8b95-d3e9d7cd1423/call-download_normal/shard-1/download_normal-1-rc.txt.tmp /gstore/cromwell_execution/FE_Somatic_Mutect2/ed746149-883f-4ef1-8b95-d3e9d7cd1423/call-download_normal/shard-1/download_normal-1-rc.txt

In this example, shard-0 succeeds and shard-1 fails, with this error messages, retrieved from AWS batch cloud watch logs:

AWS log of failed container job: image

AWS log of failed container job-proxy: image

In other examples, both succeed, both fail, or shard-0 fails and shard-1 succeeds. It doesn't seem to matter. The error is always the same, from executing the script inside the container: INVALID ARGUMENT (as shown above)

I don't think it has to do with the nature of the job (downloading a fastq) since the error isn't regarding the actual command. It's more about the communication of the job to temporary stdout / err files (I think).

If anyone has seen this or has any advice, please help. Thanks

pjongeneel commented 4 years ago

So, in my testing, this appears to only happen in the scatter is a download from s3 job. Is it possible that heavy network congestion could create this error? The error itself doesn't seem to be associated with or come from the download, but then again I'm not sure what it means.