delta-io / delta-rs

A native Rust library for Delta Lake, with bindings into Python
https://delta-io.github.io/delta-rs/
Apache License 2.0
2.34k stars 413 forks source link

Max retry exceeded when using DeltaTable with Azure Blob Storage #2669

Open erickfigueiredo opened 4 months ago

erickfigueiredo commented 4 months ago

Environment

Delta-rs version: 0.16.0

Environment:


Bug

What happened: I'm facing an issue when using the deltalake lib to save / loading data to Azure Blob Storage. Sometimes, I'm getting the following error:

DatasetError: Failed while saving data to data set CustomDeltaTableDataset(file_example).
Failed to parse parquet: Parquet error: AsyncChunkReader::get_bytes error:
Generic MicrosoftAzure error: Error after 10 retries in 2.196683949s, max_retries:10, 
retry_timeout:180s, source:error sending request for url 
(https://<address>/file.parquet):
 error trying to connect: dns error: failed to lookup address information: Name or service not known

What you expected to happen: I expected to load the data from the Delta table and convert it to a Pandas DataFrame without any errors.

How to reproduce it:

from deltalake import DeltaTable

datalake_info = {
    'account_name': <account>,
    'client_id': <cli_id>,
    'tenant_id': <tenant_id>,
    'client_secret': <secret>,
    'timeout': '100000s'
}

# Load data from the delta table
dt = DeltaTable("abfs://<azure_address>", storage_options=datalake_info)

More details: I was looking for a parameter like max_retries but couldn't find anything related. Does anyone know a solution or workaround for this issue? I didn't find an approach in the docs: https://docs.rs/object_store/latest/object_store/azure/enum.AzureConfigKey.html

erickfigueiredo commented 4 months ago

Has anyone ever faced this problem?

martindut commented 4 months ago

I'm also getting this error lately: Generic MicrosoftAzure error: Error after 10 retries in 3.296507315s, max_retries:10, retry_timeout:180s, source:HTTP status server error (503 Service Unavailable) for url (https://onelake.blob.fabric.microsoft.com/xxxxxxxxxxxxxx/Tables/dddddddddddd/_delta_log/_last_checkpoint).

The path I am using is: abfss://ws-name>@onelake.dfs.fabric.microsoft.com/<lakehousename.Lakehouse/Tables/

djouallah commented 2 months ago

this should work, using 0.18.2 and above

from azure.identity import ClientSecretCredential, AuthenticationRequiredError
credential = ClientSecretCredential(
                client_id = "appId",
                client_secret="secret",
                tenant_id= "tenantId"
                )
access_token =       credential.get_token("https://storage.azure.com/.default").token
storage_options=     {"bearer_token": access_token, "use_fabric_endpoint": "true"}
from deltalake import DeltaTable
scada = DeltaTable('abfss://workspace@onelake.dfs.fabric.microsoft.com/Lakehousename.Lakehouse/Tables/xxxxx',storage_options=storage_options)