Open shazamkash opened 1 year ago
@shazamkash - Thanks for reporting this!
From the response you showed it seems like we are running into some sort of throttling on the storage side. Though not quite sure why. Could you see what happens if you configure the pyarrow s3 filesystem adn pass that to to_pyayrrow_dataset
? https://delta-io.github.io/delta-rs/python/usage.html#custom-storage-backends.
@roeap
I tried what you asked and please find the code and errors below:
Code:
from pyarrow import fs
import deltalake as dl
storage_options = {"AWS_ACCESS_KEY_ID": f"{credentials.access_key}",
"AWS_SECRET_ACCESS_KEY": f"{credentials.secret_key}",
"AWS_ENDPOINT_URL": "https://xxx.yyy.zzz.net",
"AWS_S3_ALLOW_UNSAFE_RENAME": "True",
}
table_uri = "s3a://delta-lake-bronze/xxx/yyy/data_3_gb"
dt = dl.DeltaTable(table_uri=table_uri, storage_options=storage_options)
s3 = fs.S3FileSystem(access_key=f"{credentials.access_key}",
secret_key=f"{credentials.secret_key}",
endpoint_override='https://xxx.yyy.zzz.net')
# Fails
dataset = dt.to_pyarrow_dataset(filesystem=s3)
# Fails
dataset = dt.to_pandas(filesystem=s3)
Error from dt.to_pyarrow_dataset()
---------------------------------------------------------------------------
OSError Traceback (most recent call last)
Cell In[28], line 2
1 dataset = dt.to_pyarrow_dataset(filesystem=s3)
----> 2 dataset.to_table()
File /opt/conda/lib/python3.10/site-packages/pyarrow/_dataset.pyx:369, in pyarrow._dataset.Dataset.to_table()
File /opt/conda/lib/python3.10/site-packages/pyarrow/_dataset.pyx:2818, in pyarrow._dataset.Scanner.to_table()
File /opt/conda/lib/python3.10/site-packages/pyarrow/error.pxi:144, in pyarrow.lib.pyarrow_internal_check_status()
File /opt/conda/lib/python3.10/site-packages/pyarrow/error.pxi:115, in pyarrow.lib.check_status()
OSError: Not a regular file: '0-ccc89437-58a8-44a4-aad2-17ffce7dd929-0.parquet'
Error from dt.to_pandas()
---------------------------------------------------------------------------
OSError Traceback (most recent call last)
Cell In[26], line 1
----> 1 dt.to_pandas(filesystem=s3)
File /opt/conda/lib/python3.10/site-packages/deltalake/table.py:418, in DeltaTable.to_pandas(self, partitions, columns, filesystem)
404 def to_pandas(
405 self,
406 partitions: Optional[List[Tuple[str, str, Any]]] = None,
407 columns: Optional[List[str]] = None,
408 filesystem: Optional[Union[str, pa_fs.FileSystem]] = None,
409 ) -> "pandas.DataFrame":
410 """
411 Build a pandas dataframe using data from the DeltaTable.
412
(...)
416 :return: a pandas dataframe
417 """
--> 418 return self.to_pyarrow_table(
419 partitions=partitions, columns=columns, filesystem=filesystem
420 ).to_pandas()
File /opt/conda/lib/python3.10/site-packages/deltalake/table.py:400, in DeltaTable.to_pyarrow_table(self, partitions, columns, filesystem)
386 def to_pyarrow_table(
387 self,
388 partitions: Optional[List[Tuple[str, str, Any]]] = None,
389 columns: Optional[List[str]] = None,
390 filesystem: Optional[Union[str, pa_fs.FileSystem]] = None,
391 ) -> pyarrow.Table:
392 """
393 Build a PyArrow Table using data from the DeltaTable.
394
(...)
398 :return: the PyArrow table
399 """
--> 400 return self.to_pyarrow_dataset(
401 partitions=partitions, filesystem=filesystem
402 ).to_table(columns=columns)
File /opt/conda/lib/python3.10/site-packages/pyarrow/_dataset.pyx:369, in pyarrow._dataset.Dataset.to_table()
File /opt/conda/lib/python3.10/site-packages/pyarrow/_dataset.pyx:2818, in pyarrow._dataset.Scanner.to_table()
File /opt/conda/lib/python3.10/site-packages/pyarrow/error.pxi:144, in pyarrow.lib.pyarrow_internal_check_status()
File /opt/conda/lib/python3.10/site-packages/pyarrow/error.pxi:115, in pyarrow.lib.check_status()
OSError: Not a regular file: '0-ccc89437-58a8-44a4-aad2-17ffce7dd929-0.parquet'
Here is a list of files which I can get running the following code and this works as well
dataset = dt.to_pyarrow_dataset(filesystem=s3)
dataset.files
List of files:
['0-ccc89437-58a8-44a4-aad2-17ffce7dd929-0.parquet',
'0-ccc89437-58a8-44a4-aad2-17ffce7dd929-1.parquet',
'0-ccc89437-58a8-44a4-aad2-17ffce7dd929-3.parquet',
'0-ccc89437-58a8-44a4-aad2-17ffce7dd929-2.parquet']
@roeap
Another thing I noticed is that, this only happens with data which is "big" in size few 100's MB to few GB and is split into multiple parquet files . I can read the tables which are very small like few 10's of MB and are save in singular file.
Any help would be appreciated in this matter. Because I have read the same data before with delta-rs with an older version and back then it worked fine. Unfortunately I don't remember now what was the exact delta-rs version.
Also here is the full error which I was able to get now:
---------------------------------------------------------------------------
PyDeltaTableError Traceback (most recent call last)
Cell In[6], line 1
----> 1 dt.to_pandas()
File /opt/conda/lib/python3.10/site-packages/deltalake/table.py:418, in DeltaTable.to_pandas(self, partitions, columns, filesystem)
404 def to_pandas(
405 self,
406 partitions: Optional[List[Tuple[str, str, Any]]] = None,
407 columns: Optional[List[str]] = None,
408 filesystem: Optional[Union[str, pa_fs.FileSystem]] = None,
409 ) -> "pandas.DataFrame":
410 """
411 Build a pandas dataframe using data from the DeltaTable.
412
(...)
416 :return: a pandas dataframe
417 """
--> 418 return self.to_pyarrow_table(
419 partitions=partitions, columns=columns, filesystem=filesystem
420 ).to_pandas()
File /opt/conda/lib/python3.10/site-packages/deltalake/table.py:400, in DeltaTable.to_pyarrow_table(self, partitions, columns, filesystem)
386 def to_pyarrow_table(
387 self,
388 partitions: Optional[List[Tuple[str, str, Any]]] = None,
389 columns: Optional[List[str]] = None,
390 filesystem: Optional[Union[str, pa_fs.FileSystem]] = None,
391 ) -> pyarrow.Table:
392 """
393 Build a PyArrow Table using data from the DeltaTable.
394
(...)
398 :return: the PyArrow table
399 """
--> 400 return self.to_pyarrow_dataset(
401 partitions=partitions, filesystem=filesystem
402 ).to_table(columns=columns)
File /opt/conda/lib/python3.10/site-packages/pyarrow/_dataset.pyx:369, in pyarrow._dataset.Dataset.to_table()
File /opt/conda/lib/python3.10/site-packages/pyarrow/_dataset.pyx:2818, in pyarrow._dataset.Scanner.to_table()
File /opt/conda/lib/python3.10/site-packages/pyarrow/error.pxi:144, in pyarrow.lib.pyarrow_internal_check_status()
PyDeltaTableError: Generic S3 error: Error performing get request xxx/yyy/data_3_gb/0-ccc89437-58a8-44a4-aad2-17ffce7dd929-0.parquet: response error "<html><body><h1>429 Too Many Requests</h1>
You have sent too many requests in a given amount of time.
</body></html>
", after 0 retries: HTTP status client error (429 Too Many Requests) for url (https://xxx.yyy.zzz.net/delta-lake-bronze/xxx/yyy/data_3_gb/0-ccc89437-58a8-44a4-aad2-17ffce7dd929-0.parquet)
can I take it ?
@tsafacjo - certainly :)
Environment
Delta-rs version: 0.8.1
Binding: Python
Environment: Docker container: Python: 3.10.7 OS: Debian GNU/Linux 11 (bullseye) S3: Non-AWS (Ceph based)
Bug
What happened: When reading delta table, the table is read fine and also exists. But then converting that table to pandas or from pyarrow dataset to table is failing with the same error below.
I have tried reading the same table with PySpark and it works fine. The parquet file is about 1 GB compressed and 3 GB uncompressed in size. Furthermore, the table was written to deltalake using the same delta-rs version.
Error:
How to reproduce it: My Code:
More details: I am not sure if this information helps. But I get the same error when reading using Polars.