delta-io / delta-rs

A native Rust library for Delta Lake, with bindings into Python
https://delta-io.github.io/delta-rs/
Apache License 2.0
1.98k stars 365 forks source link

load_cdf() issue : Generic S3 error: request or response body error: operation timed out #2549

Closed jiaw314 closed 1 month ago

jiaw314 commented 1 month ago

Environment

Delta-rs version: 0.17.4

Binding: Python 3.11.6

Environment: MacBook Pro M1


Bug

What happened: The load_cdf() method works for nearly all of our delta tables on AWS S3 but it seems to be running into an error on a few:

thread '' panicked at python/src/lib.rs:611:18: called Result::unwrap() on an Err value: ArrowError(ExternalError(General("ParquetObjectReader::get_byte_ranges error: Generic S3 error: request or response body error: operation timed out")), None) stack backtrace: 0: 0x3028a50e4 - _BrotliDecoderVersion 1: 0x3028c8e50 - _BrotliDecoderVersion 2: 0x3028a1ee0 - _BrotliDecoderVersion 3: 0x3028a4f18 - _BrotliDecoderVersion 4: 0x3028a66bc - _BrotliDecoderVersion 5: 0x3028a6404 - _BrotliDecoderVersion 6: 0x3028a6af8 - _BrotliDecoderVersion 7: 0x3028a69ec - _BrotliDecoderVersion 8: 0x3028a5568 - _BrotliDecoderVersion 9: 0x3028a6774 - _BrotliDecoderVersion 10: 0x30299fb60 - _BrotliDecoderVersion 11: 0x30299ff14 - _BrotliDecoderVersion 12: 0x3001f9998 - _PyInitinternal 13: 0x30012bc1c - 14: 0x3001341f4 - 15: 0x300113ce0 - 16: 0x30012e7d4 - 17: 0x101237f1c - _method_vectorcall_VARARGS_KEYWORDS 18: 0x101303d5c - PyEval_EvalFrameDefault 19: 0x1012f9444 - _PyEval_EvalCode 20: 0x10134ea18 - _run_eval_code_obj 21: 0x10134e97c - _run_mod 22: 0x10134e7bc - _pyrun_file 23: 0x10134e20c - __PyRun_SimpleFileObject 24: 0x10134db9c - __PyRun_AnyFileObject 25: 0x101369f70 - _pymain_run_file_obj 26: 0x1013698b0 - _pymain_run_file 27: 0x101369190 - _Py_RunMain 28: 0x10136a2c8 - _Py_BytesMain Traceback (most recent call last): File "/Users/jiawang/Desktop/Environments/deltars_test/backfill&continuous_batch_pandas_catalog_v2.py", line 127, in dt.load_cdf(starting_version=delta_max_version).read_all() ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/jiawang/Desktop/Environments/deltars_test/lib/python3.11/site-packages/deltalake/table.py", line 694, in load_cdf return self._table.load_cdf( ^^^^^^^^^^^^^^^^^^^^^ pyo3_runtime.PanicException: called Result::unwrap() on an Err value: ArrowError(ExternalError(General("ParquetObjectReader::get_byte_ranges error: Generic S3 error: request or response body error: operation timed out")), None)

What you expected to happen: I expect to get the change data feed for the latest version of the delta table when I call load_cdf().

How to reproduce it: Call load_cdf() on a very large Delta table?

More details:

ion-elgreco commented 1 month ago

You can increase the timeout, https://github.com/delta-io/delta-rs/issues/2537#issuecomment-2129285237