dagster-io / dagster

An orchestration platform for the development, production, and observation of data assets.
https://dagster.io
Apache License 2.0
11.64k stars 1.47k forks source link

Running multiple backfills returns unexpectedly exit with code -9 #5480

Open samaravazquezperez opened 3 years ago

samaravazquezperez commented 3 years ago

Summary

We have an etl pipeline which makes API requests to save data files to a database instance. We have set up file partitions with each file that is saved. When we try to run multiple backfills some of them fail with a unexpectedly exit code -9. This error has only ever been encountered while backfilling, never from a manual/scheduled/sensor execution and no other errors are thrown. We have run the same number of backfills locally and in an ECS cluster and the errors are only thrown in the ECS instance.

Dagit UI/UX Issue Screenshots

image

Additional Info about Your Environment

Dagster is currently deployed in ECS


Message from the maintainers:

Impacted by this bug? Give it a 👍. We factor engagement into prioritization.

prha commented 2 years ago

Hi Samara... Can you tell me what run launcher and what run coordinator you have configured on your instance (e.g. dagster.yaml)?

The exit code -9 usually means that there's not enough resources on the box that you have running (e.g. out of memory). One possible explanation for this is that you're running the default run launcher, which means that all launched jobs will execute on the same box. If you're using default run coordinator, or the queued run coordinator with a max_concurrent_runs value that is too high, it could be that when you launch a backfill, you're trying to execute too many runs simultaneously for what your ECS instance can handle.

If this is the case, you could do one of a number of things: A) Change your configured run launcher to use something like the EcsRunLauncher so that all launched runs happen on a different ECS task. B) Change your run coordinator to use the queued run coordinator, and limit the number of concurrent runs to some number based on the memory requirements of the jobs you're running.