Closed vanetten closed 1 year ago
Thanks for trying out the solution and posting the issue. Could you include the logs from the CheckWorkflowTask lambda function? Also, have you tried re-running it from the beginning? Just want to check if it is reproducible or a transient error.
For long running jobs, the amount of polls are causing the Step Functions execution history events to reach the hard quota of 25000. For now, a workaround is to increase the duration between polls in "e2e-sfn-stack.yml" template to reduce the number of events. A long term fix is added as a to do item.
I rebuilt my genome with my pair of .fastq.gz (from Dante, Illumina) against hg38, and it mostly worked. After a little more than a day (docs estimated 3-4 hours) Step Functions finally failed in the "WaitForOmicsWorkflow" loop with a "The execution reached the maximum number of history events (25000)." error message. The Omics workflow says it completed successfully, and I think that's true. My output bucket contains the 7 various outputs I would expect (bam, vcf, et al), only Step Functions gave up before attempting PostWorkflowIngest. I attempted to use https://github.com/awslabs/aws-sfn-resume-from-any-state.git to restart from where it left off, but it appears this code wasn't expecting the sort of error I got. I imagine I could figure out what the PostWorkflowIngest steps are doing and do them manually, but wanted to know if there's some configuration I should twiddle to avoid this in the future. Also just wanted to inform you of the result of my experimental use of your code.