dagster-io / dagster

An orchestration platform for the development, production, and observation of data assets.
https://dagster.io
Apache License 2.0
11.14k stars 1.4k forks source link

retry_on_asset_or_op_failure = false is not working in k8s deployment #23687

Closed ian-scale closed 2 weeks ago

ian-scale commented 4 weeks ago

Dagster version

1.7.9

What's the issue?

I am running dagster open source, in my deployment yaml I have

dagsterDaemon:
  # https://docs.dagster.io/deployment/run-retries#combining-op-and-run-retries
  runRetries:
    enabled: true
    maxRetries: 3
    retry_on_asset_or_op_failure: false

but this is not working correctly in my k8s deployment. Trying to explicitly add tags to the job is also not working.

What did you expect to happen?

I'd expect that the job does not retry, but currently it is being retried 3 times once the op fails.

How to reproduce?

Here's a trivially simple example of a job that this doesn't work for:

@op
def fails():
  raise Failure()

@job
def fail_job():
  fails()

this shouldn't be retried, but it is being retried. I think the problem is that the tag is not being applied to the job yaml.

Deployment type

Dagster Helm chart

Deployment details

helm on k8s

Additional information

No response

Message from the maintainers

Impacted by this issue? Give it a šŸ‘! We factor engagement into prioritization.

edgan8 commented 4 weeks ago

In particular, I think the helm chart might not be pulling in the retry on asset or op failure param https://github.com/dagster-io/dagster/blob/deed3d4b6954b9b59b89416fb3eb1008d950c962/helm/dagster/templates/configmap-instance.yaml#L105

gasgallo commented 4 weeks ago

We're also affected by this issue.

The docs https://docs.dagster.io/deployment/run-retries#combining-op-and-run-retries and, specifically, the retry_on_asset_or_op_failure key in the Helm chart doesn't have any effect. Setting it or not, it doesn't affect the deployment configuration.

ian-scale commented 2 weeks ago

@garethbrickman Do you think this could be prioritized / we could get a potential timeline on the fix?

gibsondan commented 2 weeks ago

Hi @ian-scale - we'll add this to the Helm chart in a more directly supported way. In the meantime I would expect this workaround to allow you to apply the setting (by using a field called additionalInstanceConfig that lets you apply arbitrary dagster.yaml configuration in the Helm chart - setting it to enabled: false under dagsterDaemon prevents retries from being enabled in the usual way, but including it in additionalInstanceConfig ensures that it will be set after all and allows you to use this additional field

dagsterDaemon:
  runRetries:
    enabled: false

additionalInstanceConfig:
  run_retries:
    enabled: true
    max_retries: 3
    retry_on_asset_or_op_failure: false    
gibsondan commented 2 weeks ago

https://github.com/dagster-io/dagster/pull/24028 will add this to the Helm chart in a more supported way.

gasgallo commented 2 weeks ago

@gibsondan I've tried your workaround and the config now shows the correct values, but, even though retry_on_asset_or_op_failure: false, my run is retried anyways.

This is the asset I'm using for testing:

@asset(
    group_name="main",
    partitions_def=PARTITION_DEF,
)
def hello_world():
    logging.info("hello_world")
    raise RuntimeError("Testing retry...")

Any idea why the run is retried even though the failure comes from the asset?

gibsondan commented 2 weeks ago

@gasgallo the main thing i would check is that you're running dagster version 1.6.7 or later (both the version that your code is using and that your daemon/helm chart is using), as per the note at the bottom of the docs: https://docs.dagster.io/deployment/run-retries#combining-op-and-run-retries

gibsondan commented 2 weeks ago

Full Helm chart support will be added in the 1.8.5 release next week.

gasgallo commented 2 weeks ago

@gasgallo the main thing i would check is that you're running dagster version 1.6.7 or later (both the version that your code is using and that your daemon/helm chart is using), as per the note at the bottom of the docs: https://docs.dagster.io/deployment/run-retries#combining-op-and-run-retries

Ah thank you, my webserver was already up to date, but my code location wasn't. It works fine now!