Open pjongeneel opened 4 years ago
So, in my testing, this appears to only happen in the scatter is a download from s3 job. Is it possible that heavy network congestion could create this error? The error itself doesn't seem to be associated with or come from the download, but then again I'm not sure what it means.
Hi all,
I'm encountering an issue when using the AWS Batch backend. I'm using the EFS (local) file system for the backend, not S3.
I've got a workflow that downloads fastq files as an initial step. These jobs fail non-deterministically a fraction of the time. These jobs are a scatter over an input array of fastq files, and most of them generally complete. However, 20% of the shards might fail in any given scatter.
A complete job will have the following outputs in the shard output folder:
When cromwell submits the job, it auto-generates a script to download the fastq. It's a very simple job, so here's an example script:
In this example, shard-0 succeeds and shard-1 fails, with this error messages, retrieved from AWS batch cloud watch logs:
AWS log of failed container job:
AWS log of failed container job-proxy:
In other examples, both succeed, both fail, or shard-0 fails and shard-1 succeeds. It doesn't seem to matter. The error is always the same, from executing the script inside the container: INVALID ARGUMENT (as shown above)
I don't think it has to do with the nature of the job (downloading a fastq) since the error isn't regarding the actual command. It's more about the communication of the job to temporary stdout / err files (I think).
If anyone has seen this or has any advice, please help. Thanks