broadinstitute / cromwell

Scientific workflow engine designed for simplicity & scalability. Trivially transition between one off use cases to massive scale production environments
http://cromwell.readthedocs.io/
BSD 3-Clause "New" or "Revised" License
972 stars 354 forks source link

Caching timeout on S3 #5977

Open XLuyu opened 3 years ago

XLuyu commented 3 years ago

Hi,

I got a timeout exception during cache copying on AWS S3. The cache file size is 133GB. Given the file size, more time should be allowed for cache copying. Is there any config option that can tune this? Thank you in advance for any suggestions.

Backend: AWS Batch Cromwell version: 51 Error log:

Failure copying cache results for job BackendJobDescriptorKey_CommandCallNode_PreProcessingForVariantDiscovery_GATK4.SamTo FastqAndBwaMem:0:1 (TimeoutException: The Cache hit copying actor timed out waiting for a response to copy s3://xxxxx/cromwell-execution/Germ line_Somatic_Calling/441619a4-7ca8-490b-bd04-2f9981d3db0f/call-Tumor_Bam/PreProcessingForVariantDiscovery_GATK4/95aed08f-3045-45e4-94c9-ba0230851136 /call-SamToFastqAndBwaMem/shard-0/39T_R.unmerged.bam to s3://xxxxx/cromwell-execution/Germline_Somatic_Calling/c25a8561-808f-4b46-9bd2-ef0488 8c0031/call-Tumor_Bam/PreProcessingForVariantDiscovery_GATK4/8df24f46-2f4f-4557-a662-d630ac443736/call-SamToFastqAndBwaMem/shard-0/cacheCopy/39T_R.u nmerged.bam)

markjschreiber commented 3 years ago

Hi,

In Cromwell 52 we updated the S3 module to perform multithreaded, multipart copies to improve the size of results that may be cached. There are also additional improvements that have recently been merged into dev and should appear in the next release version (or you could build from source)

v52+ requires a new AWS configuration. Instructions are in https://aws-genomics-workflows.s3.amazonaws.com/Installing+the+Genomics+Workflow+Core+and+Cromwell.pdf

On Sat, Oct 24, 2020 at 8:27 PM Luyu notifications@github.com wrote:

Hi,

I got a timeout exception during cache copying on AWS S3. The cache file size is 133GB. Given the file size, more time should be allowed for cache copying. Is there any config option that can tune this? Thank you in advance for any suggestions.

Backend: AWS Batch Cromwell version: 51 Error log:

Failure copying cache results for job BackendJobDescriptorKey_CommandCallNode_PreProcessingForVariantDiscovery_GATK4.SamTo FastqAndBwaMem:0:1 (TimeoutException: The Cache hit copying actor timed out waiting for a response to copy s3://xxxxx/cromwell-execution/Germ

line_Somatic_Calling/441619a4-7ca8-490b-bd04-2f9981d3db0f/call-Tumor_Bam/PreProcessingForVariantDiscovery_GATK4/95aed08f-3045-45e4-94c9-ba0230851136 /call-SamToFastqAndBwaMem/shard-0/39T_R.unmerged.bam to s3://xxxxx/cromwell-execution/Germline_Somatic_Calling/c25a8561-808f-4b46-9bd2-ef0488

8c0031/call-Tumor_Bam/PreProcessingForVariantDiscovery_GATK4/8df24f46-2f4f-4557-a662-d630ac443736/call-SamToFastqAndBwaMem/shard-0/cacheCopy/39T_R.u nmerged.bam)

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/broadinstitute/cromwell/issues/5977, or unsubscribe https://github.com/notifications/unsubscribe-auth/AF2E6EMWLDPLNV7UM35OWWLSMNWFNANCNFSM4S56ELLQ .

XLuyu commented 3 years ago

I have moved to cromwell v53.1 now. However the caching still doesn't work for me. I consistently get an error like this:

2020-11-07 17:54:51,634 cromwell-system-akka.dispatchers.engine-dispatcher-35 INFO - 0123c178-1d7e-4fbc-94bd-7fc147e5ccfb-EngineJobExecutionActor-GATK4_WGS_ALL_IN_ONE.N_SamToFastqAndBwaMem:0:1 [UUID(0123c178)]: Failure copying cache results for job BackendJobDescriptorKey_CommandCallNode_GATK4_WGS_ALL_IN_ONE.N_SamToFastqAndBwaMem:0:1 (EnhancedCromwellIoException: [Attempted 1 time(s)] - RejectedExecutionException: ) 2020-11-07 17:54:51,635 cromwell-system-akka.dispatchers.engine-dispatcher-35 WARN - 0123c178-1d7e-4fbc-94bd-7fc147e5ccfb-EngineJobExecutionActor-GATK4_WGS_ALL_IN_ONE.N_SamToFastqAndBwaMem:0:1 [UUID(0123c178)]: Invalidating cache entry CallCachingEntryId(347) (Cache entry details: Some(7b292def-1477-4450-988a-e01627d61786:GATK4_WGS_ALL_IN_ONE.N_SamToFastqAndBwaMem:0)) 2020-11-07 17:54:51,673 cromwell-system-akka.dispatchers.backend-dispatcher-7385 WARN - 0123c178-1d7e-4fbc-94bd-7fc147e5ccfb-BackendCacheHitCopyingActor-0123c178:GATK4_WGS_ALL_IN_ONE.CreateSequenceGroupingTSV:-1:1-5 [UUID(0123c178)GATK4_WGS_ALL_IN_ONE.CreateSequenceGroupingTSV:NA:1]: Unrecognized runtime attribute keys: preemptible 2020-11-07 17:54:51,674 cromwell-system-akka.dispatchers.engine-dispatcher-38 INFO - 0123c178-1d7e-4fbc-94bd-7fc147e5ccfb-EngineJobExecutionActor-GATK4_WGS_ALL_IN_ONE.CreateSequenceGroupingTSV:NA:1 [UUID(0123c178)]: Call cache hit process had 0 total hit failures before completing successfully 2020-11-07 17:54:51,674 cromwell-system-akka.dispatchers.engine-dispatcher-33 INFO - 0123c178-1d7e-4fbc-94bd-7fc147e5ccfb-EngineJobExecutionActor-GATK4_WGS_ALL_IN_ONE.N_SamToFastqAndBwaMem:0:1 [UUID(0123c178)]: Could not copy a suitable cache hit for 0123c178:GATK4_WGS_ALL_IN_ONE.N_SamToFastqAndBwaMem:0:1. EJEA attempted to copy 1 cache hits before failing. Of these 1 failed to copy and 0 were already blacklisted from previous attempts). Falling back to running job.

As you can see, some small tasks worked but large tasks failed.

Hi, In Cromwell 52 we updated the S3 module to perform multithreaded, multipart copies to improve the size of results that may be cached. There are also additional improvements that have recently been merged into dev and should appear in the next release version (or you could build from source) v52+ requires a new AWS configuration. Instructions are in https://aws-genomics-workflows.s3.amazonaws.com/Installing+the+Genomics+Workflow+Core+and+Cromwell.pdf On Sat, Oct 24, 2020 at 8:27 PM Luyu @.***> wrote: Hi, I got a timeout exception during cache copying on AWS S3. The cache file size is 133GB. Given the file size, more time should be allowed for cache copying. Is there any config option that can tune this? Thank you in advance for any suggestions. Backend: AWS Batch Cromwell version: 51 Error log: Failure copying cache results for job BackendJobDescriptorKey_CommandCallNode_PreProcessingForVariantDiscovery_GATK4.SamTo FastqAndBwaMem:0:1 (TimeoutException: The Cache hit copying actor timed out waiting for a response to copy s3://xxxxx/cromwell-execution/Germ line_Somatic_Calling/441619a4-7ca8-490b-bd04-2f9981d3db0f/call-Tumor_Bam/PreProcessingForVariantDiscovery_GATK4/95aed08f-3045-45e4-94c9-ba0230851136 /call-SamToFastqAndBwaMem/shard-0/39T_R.unmerged.bam to s3://xxxxx/cromwell-execution/Germline_Somatic_Calling/c25a8561-808f-4b46-9bd2-ef0488 8c0031/call-Tumor_Bam/PreProcessingForVariantDiscovery_GATK4/8df24f46-2f4f-4557-a662-d630ac443736/call-SamToFastqAndBwaMem/shard-0/cacheCopy/39T_R.u nmerged.bam) — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#5977>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AF2E6EMWLDPLNV7UM35OWWLSMNWFNANCNFSM4S56ELLQ .

XLuyu commented 3 years ago

Hi,

The improved multipart copying (api: CreateMultipartUpload) doesn't work for me. The cromwell server always checks the existence of the cached file before the copying finishes. In Cromwell v51 and before, some small files <100GB were able to be successfully cached. However, with Cromwell v53, even a 6GB result file got a problem of caching and has to rerun. Is there any way to prevent the timeout of the actor?

Hi, In Cromwell 52 we updated the S3 module to perform multithreaded, multipart copies to improve the size of results that may be cached. There are also additional improvements that have recently been merged into dev and should appear in the next release version (or you could build from source) v52+ requires a new AWS configuration. Instructions are in https://aws-genomics-workflows.s3.amazonaws.com/Installing+the+Genomics+Workflow+Core+and+Cromwell.pdf On Sat, Oct 24, 2020 at 8:27 PM Luyu @.***> wrote: Hi, I got a timeout exception during cache copying on AWS S3. The cache file size is 133GB. Given the file size, more time should be allowed for cache copying. Is there any config option that can tune this? Thank you in advance for any suggestions. Backend: AWS Batch Cromwell version: 51 Error log: Failure copying cache results for job BackendJobDescriptorKey_CommandCallNode_PreProcessingForVariantDiscovery_GATK4.SamTo FastqAndBwaMem:0:1 (TimeoutException: The Cache hit copying actor timed out waiting for a response to copy s3://xxxxx/cromwell-execution/Germ line_Somatic_Calling/441619a4-7ca8-490b-bd04-2f9981d3db0f/call-Tumor_Bam/PreProcessingForVariantDiscovery_GATK4/95aed08f-3045-45e4-94c9-ba0230851136 /call-SamToFastqAndBwaMem/shard-0/39T_R.unmerged.bam to s3://xxxxx/cromwell-execution/Germline_Somatic_Calling/c25a8561-808f-4b46-9bd2-ef0488 8c0031/call-Tumor_Bam/PreProcessingForVariantDiscovery_GATK4/8df24f46-2f4f-4557-a662-d630ac443736/call-SamToFastqAndBwaMem/shard-0/cacheCopy/39T_R.u nmerged.bam) — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#5977>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AF2E6EMWLDPLNV7UM35OWWLSMNWFNANCNFSM4S56ELLQ .

markjschreiber commented 3 years ago

This is unusual, I have successfully call cached files of 1 TB in testing so I don’t know if size is the problem.

Does the issue persist after restarting the server? I committed a change to the develop branch a few weeks ago that does a better job of cleaning up the copying resources. If the restart solves the problem then you may want to build from the develop branch until the next release is sent out.

Also, is the bucket containing the source file the same bucket as the workflow bucket? If not, are they in the same region?

On Wed, Nov 11, 2020 at 4:28 AM Luyu notifications@github.com wrote:

Hi,

The improved multipart copying (api: CreateMultipartUpload) doesn't work for me. The cromwell server always checks the existence of the cached file before the copying finishes. In Cromwell v51 and before, some small files <100GB were able to be successfully cached. However, with Cromwell v53, even a 6GB result file got a problem of caching and has to rerun. Is there any way to prevent the timeout of the actor?

Hi, In Cromwell 52 we updated the S3 module to perform multithreaded, multipart copies to improve the size of results that may be cached. There are also additional improvements that have recently been merged into dev and should appear in the next release version (or you could build from source) v52+ requires a new AWS configuration. Instructions are in https://aws-genomics-workflows.s3.amazonaws.com/Installing+the+Genomics+Workflow+Core+and+Cromwell.pdf … <#m3227077625045957240> On Sat, Oct 24, 2020 at 8:27 PM Luyu @.***> wrote: Hi, I got a timeout exception during cache copying on AWS S3. The cache file size is 133GB. Given the file size, more time should be allowed for cache copying. Is there any config option that can tune this? Thank you in advance for any suggestions. Backend: AWS Batch Cromwell version: 51 Error log: Failure copying cache results for job BackendJobDescriptorKey_CommandCallNode_PreProcessingForVariantDiscovery_GATK4.SamTo FastqAndBwaMem:0:1 (TimeoutException: The Cache hit copying actor timed out waiting for a response to copy s3://xxxxx/cromwell-execution/Germ line_Somatic_Calling/441619a4-7ca8-490b-bd04-2f9981d3db0f/call-Tumor_Bam/PreProcessingForVariantDiscovery_GATK4/95aed08f-3045-45e4-94c9-ba0230851136 /call-SamToFastqAndBwaMem/shard-0/39T_R.unmerged.bam to s3://xxxxx/cromwell-execution/Germline_Somatic_Calling/c25a8561-808f-4b46-9bd2-ef0488 8c0031/call-Tumor_Bam/PreProcessingForVariantDiscovery_GATK4/8df24f46-2f4f-4557-a662-d630ac443736/call-SamToFastqAndBwaMem/shard-0/cacheCopy/39T_R.u nmerged.bam) — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#5977 https://github.com/broadinstitute/cromwell/issues/5977>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AF2E6EMWLDPLNV7UM35OWWLSMNWFNANCNFSM4S56ELLQ .

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/broadinstitute/cromwell/issues/5977#issuecomment-725311491, or unsubscribe https://github.com/notifications/unsubscribe-auth/AF2E6EJEL23DXZQ4G3JVNQ3SPJKNNANCNFSM4S56ELLQ .

XLuyu commented 3 years ago

This is unusual, I have successfully call cached files of 1 TB in testing so I don’t know if size is the problem. Does the issue persist after restarting the server? I committed a change to the develop branch a few weeks ago that does a better job of cleaning up the copying resources. If the restart solves the problem then you may want to build from the develop branch until the next release is sent out. Also, is the bucket containing the source file the same bucket as the workflow bucket? If not, are they in the same region? On Wed, Nov 11, 2020 at 4:28 AM Luyu @.> wrote: Hi, The improved multipart copying (api: CreateMultipartUpload) doesn't work for me. The cromwell server always checks the existence of the cached file before the copying finishes. In Cromwell v51 and before, some small files <100GB were able to be successfully cached. However, with Cromwell v53, even a 6GB result file got a problem of caching and has to rerun. Is there any way to prevent the timeout of the actor? Hi, In Cromwell 52 we updated the S3 module to perform multithreaded, multipart copies to improve the size of results that may be cached. There are also additional improvements that have recently been merged into dev and should appear in the next release version (or you could build from source) v52+ requires a new AWS configuration. Instructions are in https://aws-genomics-workflows.s3.amazonaws.com/Installing+the+Genomics+Workflow+Core+and+Cromwell.pdf … <#m3227077625045957240> On Sat, Oct 24, 2020 at 8:27 PM Luyu @.> wrote: Hi, I got a timeout exception during cache copying on AWS S3. The cache file size is 133GB. Given the file size, more time should be allowed for cache copying. Is there any config option that can tune this? Thank you in advance for any suggestions. Backend: AWS Batch Cromwell version: 51 Error log: Failure copying cache results for job BackendJobDescriptorKey_CommandCallNode_PreProcessingForVariantDiscovery_GATK4.SamTo FastqAndBwaMem:0:1 (TimeoutException: The Cache hit copying actor timed out waiting for a response to copy s3://xxxxx/cromwell-execution/Germ line_Somatic_Calling/441619a4-7ca8-490b-bd04-2f9981d3db0f/call-Tumor_Bam/PreProcessingForVariantDiscovery_GATK4/95aed08f-3045-45e4-94c9-ba0230851136 /call-SamToFastqAndBwaMem/shard-0/39T_R.unmerged.bam to s3://xxxxx/cromwell-execution/Germline_Somatic_Calling/c25a8561-808f-4b46-9bd2-ef0488 8c0031/call-Tumor_Bam/PreProcessingForVariantDiscovery_GATK4/8df24f46-2f4f-4557-a662-d630ac443736/call-SamToFastqAndBwaMem/shard-0/cacheCopy/39T_R.u nmerged.bam) — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#5977 <#5977>>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AF2E6EMWLDPLNV7UM35OWWLSMNWFNANCNFSM4S56ELLQ . — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#5977 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AF2E6EJEL23DXZQ4G3JVNQ3SPJKNNANCNFSM4S56ELLQ .

The weird thing is, even within the exact same attempt, stdout.log, stderr.log, and many side product files were successfully cached and copied to new places on S3, while only bam file failed. The only difference between bam file and other files is file size. I think this observation can exclude a lot of potential reasons.