delta-io / delta-rs

A native Rust library for Delta Lake, with bindings into Python
Apache License 2.0
1.98k stars 365 forks source link

Getting "error sending request for url" AzureError when writing very large deltatable to Azure Gen 2 #2639

Open Josh-Hiz opened 3 days ago

Josh-Hiz commented 3 days ago


Delta-rs version: This happens on both 0.18.1 and 0.16.1, I haven't tested anything else.

Environment: Python 3.11


What happened:

When writing an extremely large deltalake file (30000 total partition folders) to Azure Gen 2, I keep getting the following:

OSError: Generic MicrosoftAzure error: Error after 0 retries in 30.0027383s, max_retries:10, retry_timeout:180s, source:error sending request for url (url_here): operation timed out

This error happens regardless of what engine (I tested both Rust and PyArrow) and regardless of Deltalake version (I tried 0.18.1 and 0.16.1), I run the following call after creating an extremely large dataframe via pd.concat:

                data=data, # Extremely large pandas dataframe

My deltatable contains millions of rows, however this should not be an issue to write to the deltalake so I am not sure why exactly I am getting this error at all. When it comes to writing very small deltalakes (1000s of rows) its fine, what exactly can be the cause and solution to this?

Everything works, including concatenation, it just errors when I try writing the DF.

What you expected to happen:

For the write to be successful regardless of how long it takes.

How to reproduce it:

Most likely need to get an extremely large deltatable of millions of rows (gb of data) and try performing a write to Azure Gen 2.

It should be important to note the error message I gave is when I tried PyArrow, Rust is similar except it performs 10 retries, I dont know why azure wont even retry. I am using abfss in my url when going to Azure Gen 2

Josh-Hiz commented 3 days ago

After further investigating, this error propagates actually depending on the partition I chose to partition the table by, why is that? One of the partition schemes I chose was in total 267 partitions, the next scheme I chose had over 30k+, my problem here is why is this the case? Why is my choice of partition_by affecting this? It should be error free regardless of the number of partitions or the time I need to wait.

ion-elgreco commented 3 days ago

You can pass in storage_options, {"timeout": "120s"}

Josh-Hiz commented 3 days ago

You can pass in storage_options, {"timeout": "120s"}

The error still persists depending on the partition chosen.

Josh-Hiz commented 2 days ago

Through further looking into the data, I do not believe the data has anything to do with this error @ion-elgreco however possibly the number of partitions might be an issue, with the assumption of 30k+ partitions, can deltalake even handle that? If so, is there a possibility that write_deltalake is trying to write to partitions before even making the partition folder?

Josh-Hiz commented 2 days ago

@ion-elgreco Would it be an issue if I try making partitions that are of timestamp?

Josh-Hiz commented 1 day ago

ValueError: Incorrect array length for StructArray field "column_name", expected 40000 got 39999, another error frequently happening when operating on large data with Azure