NASA-IMPACT / veda-pforge-job-runner

Apache Beam + EMR Serverless Job Runner for Pangeo Forge Recipes
2 stars 2 forks source link

Intermittent Failures of ConflictException on EMR Serverless SubmitJobRun #58

Closed ranchodeluxe closed 3 months ago

ranchodeluxe commented 3 months ago

Problem

Been dealing with this bug, thought I had it fixed, doesn't make much sense but calls to submit_job_run are failing intermittently with botocore.errorfactory.ConflictException: An error occurred (ConflictException) when calling the StartJobRun operation: Request did not match the original

At first I thought it was b/c the EMR application wasn't running and so write a stateful exponential retry. That worked fine but still have these issues

Ideas

Nothing great about the error on the boto3 docs but the Java SDK docs talk about the cluster possibly being in a different state. Since boto3 does retries by default wondering if it's failing and then making multiple calls and then having another error bubble.

ranchodeluxe commented 3 months ago

fixed with idempotent token