dagster-io / dagster

An orchestration platform for the development, production, and observation of data assets.
Apache License 2.0
11.12k stars 1.39k forks source link

Clean up pods via TTL Seconds After Finished by configuring HELM deployment instead of individual Jobs. #11041

Open dangal95 opened 1 year ago

dangal95 commented 1 year ago

What's the use case?

The Kubernetes Executor creates a new Pod everytime a job runs. The pod takes a while to get deleted when the job is finished, unless you specify a tag per job in the following way:

            "dagster-k8s/config": {
                "job_spec_config": {
                    "ttl_seconds_after_finished": TIME_IN_SECONDS

The feature I am requesting would be to create a configuration that is part of the Dagster deployment, and not specific jobs, so that all Job pods can be cleaned up quicker.

Ideas of implementation

Add this functionality as a configuration in the values.yaml file used to deploy Dagster on Kubernetes.

Additional information

No response

Message from the maintainers

Impacted by this issue? Give it a 👍! We factor engagement into prioritization.

rexledesma commented 1 year ago

You can set this outside of Dagster, by setting up a mutating admission webhook in your cluster to specify the TTL of created or finished Kubernetes jobs: https://kubernetes.io/docs/concepts/workloads/controllers/ttlafterfinished/#ttl-after-finished-controller.

eminizer commented 3 months ago

Hey, looks like it's been a while on this, but I found it through a Google and wanted to add my two cents to say that I think this would be a useful feature. In another project I work with Airflow to handle orchestration, and our deployment there has a "regular cleanup" CronJob that runs on a configurable schedule to delete old pods.

TTL isn't enabled on all clusters, and it also isn't reliable for orchestration-specific use cases (like only deleting "successfully completed" pods and leaving "errored" pods around to be inspected, which Airflow does) so I personally think it's both possible to implement (in principle?) and warranted as a feature request.

I also can't add the custom "job_spec_config" as recommended in the dagster docs to solve this in our deployment since we're running with the CeleryK8sRunLauncher which is still missing passing some configs through.

stevenmurphy12 commented 2 months ago

+1. We are running Airbyte as part of our data stack which has a periodic process to clean up pods. Whereas with Dagster the pods linger until (presumably) the cluster decides to clean them up (in the region of 24 hours)