Open Josh-Hiz opened 3 days ago
After further investigating, this error propagates actually depending on the partition I chose to partition the table by, why is that? One of the partition schemes I chose was in total 267 partitions, the next scheme I chose had over 30k+, my problem here is why is this the case? Why is my choice of partition_by affecting this? It should be error free regardless of the number of partitions or the time I need to wait.
You can pass in storage_options, {"timeout": "120s"}
You can pass in storage_options, {"timeout": "120s"}
The error still persists depending on the partition chosen.
Through further looking into the data, I do not believe the data has anything to do with this error @ion-elgreco however possibly the number of partitions might be an issue, with the assumption of 30k+ partitions, can deltalake even handle that? If so, is there a possibility that write_deltalake is trying to write to partitions before even making the partition folder?
@ion-elgreco Would it be an issue if I try making partitions that are of timestamp?
ValueError: Incorrect array length for StructArray field "column_name", expected 40000 got 39999
,
another error frequently happening when operating on large data with Azure
Environment
Delta-rs version: This happens on both 0.18.1 and 0.16.1, I haven't tested anything else.
Environment: Python 3.11
Bug
What happened:
When writing an extremely large deltalake file (30000 total partition folders) to Azure Gen 2, I keep getting the following:
This error happens regardless of what engine (I tested both Rust and PyArrow) and regardless of Deltalake version (I tried 0.18.1 and 0.16.1), I run the following call after creating an extremely large dataframe via
pd.concat
:My deltatable contains millions of rows, however this should not be an issue to write to the deltalake so I am not sure why exactly I am getting this error at all. When it comes to writing very small deltalakes (1000s of rows) its fine, what exactly can be the cause and solution to this?
Everything works, including concatenation, it just errors when I try writing the DF.
What you expected to happen:
For the write to be successful regardless of how long it takes.
How to reproduce it:
Most likely need to get an extremely large deltatable of millions of rows (gb of data) and try performing a write to Azure Gen 2.
It should be important to note the error message I gave is when I tried PyArrow, Rust is similar except it performs 10 retries, I dont know why azure wont even retry. I am using abfss in my url when going to Azure Gen 2