Closed ian-scale closed 2 weeks ago
In particular, I think the helm chart might not be pulling in the retry on asset or op failure param https://github.com/dagster-io/dagster/blob/deed3d4b6954b9b59b89416fb3eb1008d950c962/helm/dagster/templates/configmap-instance.yaml#L105
We're also affected by this issue.
The docs https://docs.dagster.io/deployment/run-retries#combining-op-and-run-retries and, specifically, the retry_on_asset_or_op_failure
key in the Helm chart doesn't have any effect. Setting it or not, it doesn't affect the deployment configuration.
@garethbrickman Do you think this could be prioritized / we could get a potential timeline on the fix?
Hi @ian-scale - we'll add this to the Helm chart in a more directly supported way. In the meantime I would expect this workaround to allow you to apply the setting (by using a field called additionalInstanceConfig
that lets you apply arbitrary dagster.yaml configuration in the Helm chart - setting it to enabled: false
under dagsterDaemon
prevents retries from being enabled in the usual way, but including it in additionalInstanceConfig ensures that it will be set after all and allows you to use this additional field
dagsterDaemon:
runRetries:
enabled: false
additionalInstanceConfig:
run_retries:
enabled: true
max_retries: 3
retry_on_asset_or_op_failure: false
https://github.com/dagster-io/dagster/pull/24028 will add this to the Helm chart in a more supported way.
@gibsondan I've tried your workaround and the config now shows the correct values, but, even though retry_on_asset_or_op_failure: false
, my run is retried anyways.
This is the asset I'm using for testing:
@asset(
group_name="main",
partitions_def=PARTITION_DEF,
)
def hello_world():
logging.info("hello_world")
raise RuntimeError("Testing retry...")
Any idea why the run is retried even though the failure comes from the asset?
@gasgallo the main thing i would check is that you're running dagster version 1.6.7 or later (both the version that your code is using and that your daemon/helm chart is using), as per the note at the bottom of the docs: https://docs.dagster.io/deployment/run-retries#combining-op-and-run-retries
Full Helm chart support will be added in the 1.8.5 release next week.
@gasgallo the main thing i would check is that you're running dagster version 1.6.7 or later (both the version that your code is using and that your daemon/helm chart is using), as per the note at the bottom of the docs: https://docs.dagster.io/deployment/run-retries#combining-op-and-run-retries
Ah thank you, my webserver was already up to date, but my code location wasn't. It works fine now!
Dagster version
1.7.9
What's the issue?
I am running dagster open source, in my deployment yaml I have
but this is not working correctly in my k8s deployment. Trying to explicitly add tags to the job is also not working.
What did you expect to happen?
I'd expect that the job does not retry, but currently it is being retried 3 times once the op fails.
How to reproduce?
Here's a trivially simple example of a job that this doesn't work for:
this shouldn't be retried, but it is being retried. I think the problem is that the tag is not being applied to the job yaml.
Deployment type
Dagster Helm chart
Deployment details
helm on k8s
Additional information
No response
Message from the maintainers
Impacted by this issue? Give it a š! We factor engagement into prioritization.