Clean up pods via TTL Seconds After Finished by configuring HELM deployment instead of individual Jobs.

dagster-io / dagster

An orchestration platform for the development, production, and observation of data assets.

https://dagster.io

Apache License 2.0

11.12k stars 1.39k forks source link

Clean up pods via TTL Seconds After Finished by configuring HELM deployment instead of individual Jobs. #11041

Open dangal95 opened 1 year ago

dangal95 commented 1 year ago

What's the use case?

The Kubernetes Executor creates a new Pod everytime a job runs. The pod takes a while to get deleted when the job is finished, unless you specify a tag per job in the following way:

tags={
            "dagster-k8s/config": {
                "job_spec_config": {
                    "ttl_seconds_after_finished": TIME_IN_SECONDS
                } 
            }
        },

The feature I am requesting would be to create a configuration that is part of the Dagster deployment, and not specific jobs, so that all Job pods can be cleaned up quicker.

Ideas of implementation

Add this functionality as a configuration in the values.yaml file used to deploy Dagster on Kubernetes.

Additional information

No response

Message from the maintainers

Impacted by this issue? Give it a 👍! We factor engagement into prioritization.

rexledesma commented 1 year ago

You can set this outside of Dagster, by setting up a mutating admission webhook in your cluster to specify the TTL of created or finished Kubernetes jobs: https://kubernetes.io/docs/concepts/workloads/controllers/ttlafterfinished/#ttl-after-finished-controller.

eminizer commented 3 months ago

Hey, looks like it's been a while on this, but I found it through a Google and wanted to add my two cents to say that I think this would be a useful feature. In another project I work with Airflow to handle orchestration, and our deployment there has a "regular cleanup" CronJob that runs on a configurable schedule to delete old pods.

TTL isn't enabled on all clusters, and it also isn't reliable for orchestration-specific use cases (like only deleting "successfully completed" pods and leaving "errored" pods around to be inspected, which Airflow does) so I personally think it's both possible to implement (in principle?) and warranted as a feature request.

I also can't add the custom "job_spec_config" as recommended in the dagster docs to solve this in our deployment since we're running with the CeleryK8sRunLauncher which is still missing passing some configs through.

stevenmurphy12 commented 2 months ago

+1. We are running Airbyte as part of our data stack which has a periodic process to clean up pods. Whereas with Dagster the pods linger until (presumably) the cluster decides to clean them up (in the region of 24 hours)