delta-io / delta-rs

A native Rust library for Delta Lake, with bindings into Python
https://delta-io.github.io/delta-rs/
Apache License 2.0
2.32k stars 408 forks source link

Generic S3 error: Converting table to pandas and pyarrow table fails. #1256

Open shazamkash opened 1 year ago

shazamkash commented 1 year ago

Environment

Delta-rs version: 0.8.1

Binding: Python

Environment: Docker container: Python: 3.10.7 OS: Debian GNU/Linux 11 (bullseye) S3: Non-AWS (Ceph based)


Bug

What happened: When reading delta table, the table is read fine and also exists. But then converting that table to pandas or from pyarrow dataset to table is failing with the same error below.

I have tried reading the same table with PySpark and it works fine. The parquet file is about 1 GB compressed and 3 GB uncompressed in size. Furthermore, the table was written to deltalake using the same delta-rs version.

Error:

---------------------------------------------------------------------------
PyDeltaTableError                         Traceback (most recent call last)
Cell In[6], line 1
----> 1 dt.to_pandas()

File /opt/conda/lib/python3.10/site-packages/deltalake/table.py:418, in DeltaTable.to_pandas(self, partitions, columns, filesystem)
    404 def to_pandas(
    405     self,
    406     partitions: Optional[List[Tuple[str, str, Any]]] = None,
    407     columns: Optional[List[str]] = None,
    408     filesystem: Optional[Union[str, pa_fs.FileSystem]] = None,
    409 ) -> "pandas.DataFrame":
    410     """
    411     Build a pandas dataframe using data from the DeltaTable.
    412 
   (...)
    416     :return: a pandas dataframe
    417     """
--> 418     return self.to_pyarrow_table(
    419         partitions=partitions, columns=columns, filesystem=filesystem
    420     ).to_pandas()

File /opt/conda/lib/python3.10/site-packages/deltalake/table.py:400, in DeltaTable.to_pyarrow_table(self, partitions, columns, filesystem)
    386 def to_pyarrow_table(
    387     self,
    388     partitions: Optional[List[Tuple[str, str, Any]]] = None,
    389     columns: Optional[List[str]] = None,
    390     filesystem: Optional[Union[str, pa_fs.FileSystem]] = None,
    391 ) -> pyarrow.Table:
    392     """
    393     Build a PyArrow Table using data from the DeltaTable.
    394 
   (...)
    398     :return: the PyArrow table
    399     """
--> 400     return self.to_pyarrow_dataset(
    401         partitions=partitions, filesystem=filesystem
    402     ).to_table(columns=columns)

File /opt/conda/lib/python3.10/site-packages/pyarrow/_dataset.pyx:369, in pyarrow._dataset.Dataset.to_table()

File /opt/conda/lib/python3.10/site-packages/pyarrow/_dataset.pyx:2818, in pyarrow._dataset.Scanner.to_table()

File /opt/conda/lib/python3.10/site-packages/pyarrow/error.pxi:144, in pyarrow.lib.pyarrow_internal_check_status()

PyDeltaTableError: Generic S3 error: Error performing get request xxx/yyy/data_3_gb/0-ccc89437-58a8-44a4-aad2-17ffce7dd929-0.parquet: response error "<html><body><h1>429 Too Many Requests</h1>
You have sent too many requests in a given amount of time.
</body></html>
", after 0 retries: HTTP status client error (429 Too Many Requests) for url (https://xxx.yyy.zzz.net/delta-lake-bronze/xxx/yyy/data_3_gb/0-ccc89437-58a8-44a4-aad2-17ffce7dd929-0.parquet)

How to reproduce it: My Code:

storage_options = {"AWS_ACCESS_KEY_ID": f"{credentials.access_key}", 
                   "AWS_SECRET_ACCESS_KEY": f"{credentials.secret_key}",
                   "AWS_ENDPOINT_URL": "https://xxx.yyy.zzz.net",
                   "AWS_S3_ALLOW_UNSAFE_RENAME": "True",
                  }

table_uri = "s3a://delta-lake-bronze/xxx/yyy/data_3_gb"
dt = dl.DeltaTable(table_uri=table_uri, storage_options=storage_options)

# Coverting to pandas fails
dt.to_pandas()

# Converting from pyarrow dataset to table fails as well
dataset = dt.to_pyarrow_dataset()
dataset.to_table()

More details: I am not sure if this information helps. But I get the same error when reading using Polars.

roeap commented 1 year ago

@shazamkash - Thanks for reporting this!

From the response you showed it seems like we are running into some sort of throttling on the storage side. Though not quite sure why. Could you see what happens if you configure the pyarrow s3 filesystem adn pass that to to_pyayrrow_dataset? https://delta-io.github.io/delta-rs/python/usage.html#custom-storage-backends.

shazamkash commented 1 year ago

@roeap

I tried what you asked and please find the code and errors below:

Code:

from pyarrow import fs
import deltalake as dl

storage_options = {"AWS_ACCESS_KEY_ID": f"{credentials.access_key}", 
                   "AWS_SECRET_ACCESS_KEY": f"{credentials.secret_key}",
                   "AWS_ENDPOINT_URL": "https://xxx.yyy.zzz.net",
                   "AWS_S3_ALLOW_UNSAFE_RENAME": "True",
                  }

table_uri = "s3a://delta-lake-bronze/xxx/yyy/data_3_gb"
dt = dl.DeltaTable(table_uri=table_uri, storage_options=storage_options)

s3 = fs.S3FileSystem(access_key=f"{credentials.access_key}",
                                 secret_key=f"{credentials.secret_key}",
                                 endpoint_override='https://xxx.yyy.zzz.net')

# Fails
dataset = dt.to_pyarrow_dataset(filesystem=s3)

# Fails
dataset = dt.to_pandas(filesystem=s3)

Error from dt.to_pyarrow_dataset()

---------------------------------------------------------------------------
OSError                                   Traceback (most recent call last)
Cell In[28], line 2
      1 dataset = dt.to_pyarrow_dataset(filesystem=s3)
----> 2 dataset.to_table()

File /opt/conda/lib/python3.10/site-packages/pyarrow/_dataset.pyx:369, in pyarrow._dataset.Dataset.to_table()

File /opt/conda/lib/python3.10/site-packages/pyarrow/_dataset.pyx:2818, in pyarrow._dataset.Scanner.to_table()

File /opt/conda/lib/python3.10/site-packages/pyarrow/error.pxi:144, in pyarrow.lib.pyarrow_internal_check_status()

File /opt/conda/lib/python3.10/site-packages/pyarrow/error.pxi:115, in pyarrow.lib.check_status()

OSError: Not a regular file: '0-ccc89437-58a8-44a4-aad2-17ffce7dd929-0.parquet'

Error from dt.to_pandas()

---------------------------------------------------------------------------
OSError                                   Traceback (most recent call last)
Cell In[26], line 1
----> 1 dt.to_pandas(filesystem=s3)

File /opt/conda/lib/python3.10/site-packages/deltalake/table.py:418, in DeltaTable.to_pandas(self, partitions, columns, filesystem)
    404 def to_pandas(
    405     self,
    406     partitions: Optional[List[Tuple[str, str, Any]]] = None,
    407     columns: Optional[List[str]] = None,
    408     filesystem: Optional[Union[str, pa_fs.FileSystem]] = None,
    409 ) -> "pandas.DataFrame":
    410     """
    411     Build a pandas dataframe using data from the DeltaTable.
    412 
   (...)
    416     :return: a pandas dataframe
    417     """
--> 418     return self.to_pyarrow_table(
    419         partitions=partitions, columns=columns, filesystem=filesystem
    420     ).to_pandas()

File /opt/conda/lib/python3.10/site-packages/deltalake/table.py:400, in DeltaTable.to_pyarrow_table(self, partitions, columns, filesystem)
    386 def to_pyarrow_table(
    387     self,
    388     partitions: Optional[List[Tuple[str, str, Any]]] = None,
    389     columns: Optional[List[str]] = None,
    390     filesystem: Optional[Union[str, pa_fs.FileSystem]] = None,
    391 ) -> pyarrow.Table:
    392     """
    393     Build a PyArrow Table using data from the DeltaTable.
    394 
   (...)
    398     :return: the PyArrow table
    399     """
--> 400     return self.to_pyarrow_dataset(
    401         partitions=partitions, filesystem=filesystem
    402     ).to_table(columns=columns)

File /opt/conda/lib/python3.10/site-packages/pyarrow/_dataset.pyx:369, in pyarrow._dataset.Dataset.to_table()

File /opt/conda/lib/python3.10/site-packages/pyarrow/_dataset.pyx:2818, in pyarrow._dataset.Scanner.to_table()

File /opt/conda/lib/python3.10/site-packages/pyarrow/error.pxi:144, in pyarrow.lib.pyarrow_internal_check_status()

File /opt/conda/lib/python3.10/site-packages/pyarrow/error.pxi:115, in pyarrow.lib.check_status()

OSError: Not a regular file: '0-ccc89437-58a8-44a4-aad2-17ffce7dd929-0.parquet'

Here is a list of files which I can get running the following code and this works as well

dataset = dt.to_pyarrow_dataset(filesystem=s3)
dataset.files

List of files:

['0-ccc89437-58a8-44a4-aad2-17ffce7dd929-0.parquet',
 '0-ccc89437-58a8-44a4-aad2-17ffce7dd929-1.parquet',
 '0-ccc89437-58a8-44a4-aad2-17ffce7dd929-3.parquet',
 '0-ccc89437-58a8-44a4-aad2-17ffce7dd929-2.parquet']
shazamkash commented 1 year ago

@roeap

Another thing I noticed is that, this only happens with data which is "big" in size few 100's MB to few GB and is split into multiple parquet files . I can read the tables which are very small like few 10's of MB and are save in singular file.

Any help would be appreciated in this matter. Because I have read the same data before with delta-rs with an older version and back then it worked fine. Unfortunately I don't remember now what was the exact delta-rs version.

Also here is the full error which I was able to get now:

---------------------------------------------------------------------------
PyDeltaTableError                         Traceback (most recent call last)
Cell In[6], line 1
----> 1 dt.to_pandas()

File /opt/conda/lib/python3.10/site-packages/deltalake/table.py:418, in DeltaTable.to_pandas(self, partitions, columns, filesystem)
    404 def to_pandas(
    405     self,
    406     partitions: Optional[List[Tuple[str, str, Any]]] = None,
    407     columns: Optional[List[str]] = None,
    408     filesystem: Optional[Union[str, pa_fs.FileSystem]] = None,
    409 ) -> "pandas.DataFrame":
    410     """
    411     Build a pandas dataframe using data from the DeltaTable.
    412 
   (...)
    416     :return: a pandas dataframe
    417     """
--> 418     return self.to_pyarrow_table(
    419         partitions=partitions, columns=columns, filesystem=filesystem
    420     ).to_pandas()

File /opt/conda/lib/python3.10/site-packages/deltalake/table.py:400, in DeltaTable.to_pyarrow_table(self, partitions, columns, filesystem)
    386 def to_pyarrow_table(
    387     self,
    388     partitions: Optional[List[Tuple[str, str, Any]]] = None,
    389     columns: Optional[List[str]] = None,
    390     filesystem: Optional[Union[str, pa_fs.FileSystem]] = None,
    391 ) -> pyarrow.Table:
    392     """
    393     Build a PyArrow Table using data from the DeltaTable.
    394 
   (...)
    398     :return: the PyArrow table
    399     """
--> 400     return self.to_pyarrow_dataset(
    401         partitions=partitions, filesystem=filesystem
    402     ).to_table(columns=columns)

File /opt/conda/lib/python3.10/site-packages/pyarrow/_dataset.pyx:369, in pyarrow._dataset.Dataset.to_table()

File /opt/conda/lib/python3.10/site-packages/pyarrow/_dataset.pyx:2818, in pyarrow._dataset.Scanner.to_table()

File /opt/conda/lib/python3.10/site-packages/pyarrow/error.pxi:144, in pyarrow.lib.pyarrow_internal_check_status()

PyDeltaTableError: Generic S3 error: Error performing get request xxx/yyy/data_3_gb/0-ccc89437-58a8-44a4-aad2-17ffce7dd929-0.parquet: response error "<html><body><h1>429 Too Many Requests</h1>
You have sent too many requests in a given amount of time.
</body></html>
", after 0 retries: HTTP status client error (429 Too Many Requests) for url (https://xxx.yyy.zzz.net/delta-lake-bronze/xxx/yyy/data_3_gb/0-ccc89437-58a8-44a4-aad2-17ffce7dd929-0.parquet)
tsafacjo commented 1 year ago

can I take it ?

roeap commented 1 year ago

@tsafacjo - certainly :)