broadinstitute / cromwell

Scientific workflow engine designed for simplicity & scalability. Trivially transition between one off use cases to massive scale production environments
http://cromwell.readthedocs.io/
BSD 3-Clause "New" or "Revised" License
1k stars 359 forks source link

S3 filesystem retry download #5946

Open rdavison opened 4 years ago

rdavison commented 4 years ago

Originally posted this two in the JIRA issue tracker back in August. Reposting here since it didn't get a response over there: https://broadworkbench.atlassian.net/browse/BA-6548

Hello everyone,

I am attempting to use the AWS Batch backend for Cromwell to run a wdl script which runs several subjobs in parallel. I believe the correct parlance is a scatter. I noticed that in some of the jobs of the scatter, some reference files failed to download from S3 even though they existed (Connection Reset by Peer). This failure caused the overall job to fail after one hour of running.

I believe this issue was reported and fixed before, around May 2019, but recently, in June 2020, it appears the AWS Batch backend was majorly overhauled (by @markjschreiber, thanks! Also, tagging you because I suspect you might be the resident expert here :) ), and the previous fix (using the ecs proxy image) was supposedly obsoleted.

I also see that the s3fs library appears to be vendored into cromwell, and after digging around, it appears that one might be able to set retries via an environment variable(?). But even then, I feel like if that were to work, it would be much nicer if it was configurable through cromwell's config file somehow.

So that brings me to my final question. Is there some configuration that allows me to retry failed downloads some number of times before failing the whole job? Or, perhap there is some alternative configuraiton which I've overlooked and someone could point me to it? Thanks!

In addition, just wondering if perhaps there is a service limit I might be running into?

markjschreiber commented 4 years ago

There is a Pull request in for AWS CLI call retry's which will mitigate some of the problem. Currently full retries of tasks are not supported via Cromwell Server coordinating with the AWS Batch backend. Having said that, you could identify the AWS Batch Job Description and edit it to create a new revision such that the revision uses the AWS Batch retry strategy. This will mean that AWS Batch will retry any job that doesn't exit cleanly (return code 0 or container host is terminated) up to a max number of times. When that happens, the job ID remains the same so as far as Cromwell knows it is the same job. I haven't had a chance to test this out myself but it's on my to do list. Let me know if you try it. If it works the same approach would allow for recovery in the case of Spot interruption.

https://docs.aws.amazon.com/batch/latest/userguide/job_retries.html https://docs.aws.amazon.com/batch/latest/userguide/job_definition_parameters.html#retryStrategy

On Wed, Oct 14, 2020 at 2:40 PM Richard Davison notifications@github.com wrote:

Originally posted this two in the JIRA issue tracker back in August. Reposting here since it didn't get a response over there: https://broadworkbench.atlassian.net/browse/BA-6548

Hello everyone,

I am attempting to use the AWS Batch backend for Cromwell to run a wdl script which runs several subjobs in parallel. I believe the correct parlance is a scatter. I noticed that in some of the jobs of the scatter, some reference files failed to download from S3 even though they existed (Connection Reset by Peer). This failure caused the overall job to fail after one hour of running.

I believe this issue was reported and fixed before, around May 2019, but recently, in June 2020, it appears the AWS Batch backend was majorly overhauled (by @markjschreiber https://github.com/markjschreiber, thanks! Also, tagging you because I suspect you might be the resident expert here :) ), and the previous fix (using the ecs proxy image) was supposedly obsoleted.

I also see that the s3fs library appears to be vendored into cromwell, and after digging around, it appears that one might be able to set retries via an environment variable(?). But even then, I feel like if that were to work, it would be much nicer if it was configurable through cromwell's config file somehow.

So that brings me to my final question. Is there some configuration that allows me to retry failed downloads some number of times before failing the whole job? Or, perhap there is some alternative configuraiton which I've overlooked and someone could point me to it? Thanks!

In addition, just wondering if perhaps there is a service limit I might be running into?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/broadinstitute/cromwell/issues/5946, or unsubscribe https://github.com/notifications/unsubscribe-auth/AF2E6ENB62FDV4UVUUQGAE3SKXWAPANCNFSM4SQ7HRGQ .

rdavison commented 4 years ago

@markjschreiber

I tested your theory, and while the job was able to complete successfully the second time around (after changing the job definition), it didn't update the status in the Cromwell database. Do you reckon it should be possible for me to manually change a record in the database in order to get cromwell to continue where it left off, or will I need to resubmit the entire workflow, and hope that CallCaching is working?

In this particular workflow I'm running, I've observed that CallCaching works.... sometimes(?).... but I was surprised by the amount of Cache misses I observed, which I'm not really sure how to troubleshoot.

rdavison commented 4 years ago

When does the database get notified of a job's failure?

or

I'm asking because from what I can tell, once a workflow is in a terminal state, some records are deleted from the database, which means that it would be impossible to try to run a job in a failed state. This is precisely what I tested: I navigated to the failed job in AWS Batch, and then pressed the "Clone Job" button.

Perhaps a better test would be to literally create a new Job Description revision (as you pointed out earlier) to see if a failed attempt can be rerun without impacting the status of the workflow.

As for my current situation, it seems I'm SOL, and just have to bit the bullet and resubmit the entire workflow and cross my fingers for Call Caching to work. (just for the record, I installed cromwell by following the instructions from here https://github.com/broadinstitute/cromwell/issues/5977#issuecomment-716229310 )

markjschreiber commented 4 years ago

Hi Richard,

The Cromwell server is responsible for updating the database. The general flow of information is AWS Batch -> Cromwell AWS Batch Backend Module -> Cromwell Metadata actor -> DB

Cromwell only becomes aware of a failure if AWS Batch Backend Module detects a failure in Batch (usually a non zero return code for the job). I haven't tested it but I think if you define a retry strategy in the job definition then Cromwell will not even be aware of the retry unless all of the retries fail.

Any or all of the Metadata entries in the database can be deleted if you observe weird caching behavior. You can even drop the whole DB and the Cromwell server will regenerate it the next time it starts.

On Thu, Nov 19, 2020 at 3:51 AM Richard Davison notifications@github.com wrote:

When does the database get notified of a job's failure?

  • the moment the job fails

or

  • when AWS Batch finally gives up trying to run the job

I'm asking because from what I can tell, once a workflow is in a terminal state, some records are deleted from the database, which means that it would be impossible to try to run a job in a failed state. This is precisely what I tested: I navigated to the failed job in AWS Batch, and then pressed the "Clone Job" button.

Perhaps a better test would be to literally create a new Job Description revision (as you pointed out earlier) to see if Batch a failed attempt can be rerun without impacting the status of the workflow.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/broadinstitute/cromwell/issues/5946#issuecomment-730224182, or unsubscribe https://github.com/notifications/unsubscribe-auth/AF2E6EJLBXTJDA4SKYT4Y43SQTMADANCNFSM4SQ7HRGQ .

rdavison commented 4 years ago

If it works the same approach would allow for recovery in the case of Spot interruption

By the way, speaking of this, how would I submit a job to an on-demand compute environment manually? It seems whenever I submit a workflow to cromwell, it always runs in a spot instance.

markjschreiber commented 4 years ago

Currently the cromwell.conf file specifies the ARN of the queue that jobs are submitted to. You can either change this to a new queue or you can change the queue to use (or prioritize) a compute environment that uses on demand instances.

On Thu, Nov 19, 2020 at 2:32 PM Richard Davison notifications@github.com wrote:

If it works the same approach would allow for recovery in the case of Spot interruption

By the way, speaking of this, how would I submit a job to an on-demand compute environment manually? It seems whenever I submit a workflow to cromwell, it always runs in a spot instance.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/broadinstitute/cromwell/issues/5946#issuecomment-730590208, or unsubscribe https://github.com/notifications/unsubscribe-auth/AF2E6EPPHWFJT3BFIOU2TCLSQVXFVANCNFSM4SQ7HRGQ .