dagster-io / dagster

An orchestration platform for the development, production, and observation of data assets.
https://dagster.io
Apache License 2.0
10.71k stars 1.33k forks source link

LeaseLost Error in Azure Data Lake Storage Gen2 IO Manager During Step Output Handling #20403

Open tomas-gajarsky opened 3 months ago

tomas-gajarsky commented 3 months ago

Dagster version

1.4.17

What's the issue?

LeaseLost error when writing outputs to Azure Blob Storage using Dagster Azure ADLS2 IO Manager.

During the execution of a Dagster pipeline, specifically at the step "voucher_type_etl__normalize_boxes_to_new_image_size," we encountered an HttpResponseError: LeaseLost when attempting to write the step output to Azure Blob Storage. This error is unexpected as we have other pipelines successfully writing to Azure Blob Storage without any issues. The error details suggest a lease management problem with the Azure Data Lake Storage Gen2 resources.

What did you expect to happen?

We expected the pipeline's output to be written successfully to Azure Blob Storage, similar to how other pipelines are functioning without encountering the LeaseLost error.

How to reproduce?

Currently, specific steps to reproduce the issue are not available for sharing.

Deployment type

Other

Deployment details

The issue has been observed in both our Dagster Cloud deployment and a standalone deployment using Docker Compose on a virtual machine. We are using Dagster with the dagster_azure plugin to integrate with Azure services.

Additional information

The issue appears to be isolated to a specific pipeline when attempting to write outputs to Azure Blob Storage, resulting in a LeaseLost error from the Azure Data Lake Storage Gen2 service. Other pipelines within the same deployment environment do not exhibit this problem, suggesting a potential issue with how the affected pipeline manages or acquires leases for blob storage resources.

Error log:

dagster._core.errors.DagsterExecutionHandleOutputError: Error occurred while handling output "result" of step "voucher_type_etl__normalize_boxes_to_new_image_size":
...
azure.core.exceptions.HttpResponseError: (LeaseLost) A lease ID was specified, but the lease for the resource has expired.
RequestId:18077b19-c01f-0039-4d14-6e881a000000
Time:2024-03-04T09:18:30.4947974Z
...

Message from the maintainers

Impacted by this issue? Give it a 👍! We factor engagement into prioritization.

mlarose commented 3 months ago

Hi @tomas-gajarsky, thanks for reporting this. Any difference between this failing pipeline and the successful ones in how they interact with Azure? Is the execution fast enough to preclude an expiration after 1 min on the lease in one case or not the other?

At a glance there is effectively no logic or support for automatic lease renewal before or after expiration.

Anything else that can help reproduce and diagnose this issue?

tomas-gajarsky commented 3 months ago

Hi @mlarose,

Upon further investigation, I've found that both pipelines interact with Azure in the same manner. The difference leading to the observed issue might be the execution time, which is notably longer when dealing with large assets. This extended execution time makes the current lease duration of 60 seconds potentially insufficient.

I believe that your observation about the absence of logic or support for automatic lease renewal is correct. Currently, it seems like there isn't a way to adjust the lease duration directly through an environment variable, as it appears to be a hardcoded global variable within the io manager. However, the Azure documentation suggests that the lease client can have its duration set to any value between 15 to 60 seconds, or even to an unlimited lease by specifying a duration of -1.

Given this, and considering that the lease is explicitly released in the finally block of the try statement, it might be prudent to explore setting the lease duration to -1 in the io manager by default, or at least make it configurable. This change could potentially mitigate the issues we're encountering with lease expirations during the handling of large assets.

Thank you for your guidance and support on this matter.