delta-io / delta-rs

A native Rust library for Delta Lake, with bindings into Python
https://delta-io.github.io/delta-rs/
Apache License 2.0
1.97k stars 365 forks source link

`to_pyarrow_table()` on a table in S3 kept getting "Generic S3 error: error decoding response body" #2595

Open k-ye opened 2 weeks ago

k-ye commented 2 weeks ago

Environment

Delta-rs version: deltalake==0.18.1

Binding: Python

Environment:


Bug

What happened:

Trying to do a simple table loading from S3, but kept getting this OSError: Generic S3 error: error decoding response body

table = DeltaTable(table_uri, storage_options=storage_options)
print(f"version: {table.version()}")
print(f"schema: {table.schema()}")
print(table.files())

ts = time.time()
df = table.to_pyarrow_table()
version: 0
schema: Schema([Field(id, PrimitiveType("string"), nullable=True), Field(path, PrimitiveType("string"), nullable=True)])
['0-e03dac34-16a0-4b6e-82c8-fd1098d1bf45-0.parquet']
Traceback (most recent call last):
  File "test.py", line 32, in <module>
    df = table.to_pyarrow_table()
  File "***/lib/python3.10/site-packages/deltalake/table.py", line 1161, in to_pyarrow_table
    return self.to_pyarrow_dataset(
  File "pyarrow/_dataset.pyx", line 562, in pyarrow._dataset.Dataset.to_table
  File "pyarrow/_dataset.pyx", line 3804, in pyarrow._dataset.Scanner.to_table
  File "pyarrow/error.pxi", line 154, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 88, in pyarrow.lib.check_status
OSError: Generic S3 error: error decoding response body

Stack shows that this is actually in pyarrow. Not sure if it possible to tweak pyarrow's behavior with S3 from deltalake.

What you expected to happen:

I can get the pyarrow table.

How to reproduce it:

More details:

I have verified the integrity of this table with these methods:

  1. Cloning the table locally, then load from there. to_pyarrow_table() runs fine.
  2. Reading the S3 table with duckdb (and its delta extension). Worked fine, too.
k-ye commented 2 weeks ago

Seems related to https://github.com/delta-io/delta-rs/issues/2301 and https://github.com/delta-io/delta-rs/issues/2592