Open samaravazquezperez opened 3 years ago
Hi Samara... Can you tell me what run launcher and what run coordinator you have configured on your instance (e.g. dagster.yaml
)?
The exit code -9 usually means that there's not enough resources on the box that you have running (e.g. out of memory). One possible explanation for this is that you're running the default run launcher, which means that all launched jobs will execute on the same box. If you're using default run coordinator, or the queued run coordinator with a max_concurrent_runs
value that is too high, it could be that when you launch a backfill, you're trying to execute too many runs simultaneously for what your ECS instance can handle.
If this is the case, you could do one of a number of things: A) Change your configured run launcher to use something like the EcsRunLauncher so that all launched runs happen on a different ECS task. B) Change your run coordinator to use the queued run coordinator, and limit the number of concurrent runs to some number based on the memory requirements of the jobs you're running.
Summary
We have an etl pipeline which makes API requests to save data files to a database instance. We have set up file partitions with each file that is saved. When we try to run multiple backfills some of them fail with a unexpectedly exit code -9. This error has only ever been encountered while backfilling, never from a manual/scheduled/sensor execution and no other errors are thrown. We have run the same number of backfills locally and in an ECS cluster and the errors are only thrown in the ECS instance.
Dagit UI/UX Issue Screenshots
Additional Info about Your Environment
Dagster is currently deployed in ECS
Message from the maintainers:
Impacted by this bug? Give it a 👍. We factor engagement into prioritization.