earth-mover / icechunk

Open-source, cloud-native transactional tensor storage engine
https://icechunk.io
Apache License 2.0
255 stars 14 forks source link

Intermittent `streaming bytes from object store error` when reading Zarr array from object store #363

Open dreamtalen opened 4 days ago

dreamtalen commented 4 days ago

Hi icechunk team,

When attempting to read a large Zarr array using IceChunk, I encounter an intermittent error: ValueError: store error: unsuccessful repository operation: error contacting storage error streaming bytes from object store streaming error

This issue can be reproduced simplely using zarr_array[:], and it occurs more frequently as the array size increases. We think it is due to occasional streaming interruptions from the object store server, a retry would help here.

Below is the full traceback for reference:

File "/home/yongmingd/benchmark-zarr3/read_icechunk.py", line 12, in read_chunks
    data = zarr_array[:]
           ~~~~~~~~~~^^^
  File "/home/yongmingd/zarr-v3-benchmark/zarr-python/src/zarr/core/array.py", line 1709, in __getitem__
    return self.get_orthogonal_selection(pure_selection, fields=fields)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/yongmingd/zarr-v3-benchmark/zarr-python/src/zarr/_compat.py", line 43, in inner_f
    return f(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^
  File "/home/yongmingd/zarr-v3-benchmark/zarr-python/src/zarr/core/array.py", line 2151, in get_orthogonal_selection
    return sync(
           ^^^^^
  File "/home/yongmingd/zarr-v3-benchmark/zarr-python/src/zarr/core/sync.py", line 141, in sync
    raise return_result
  File "/home/yongmingd/zarr-v3-benchmark/zarr-python/src/zarr/core/sync.py", line 100, in _runner
    return await coro
           ^^^^^^^^^^
  File "/home/yongmingd/zarr-v3-benchmark/zarr-python/src/zarr/core/array.py", line 958, in _get_selection
    await self.codec_pipeline.read(
  File "/home/yongmingd/zarr-v3-benchmark/zarr-python/src/zarr/codecs/pipeline.py", line 440, in read
    await concurrent_map(
  File "/home/yongmingd/zarr-v3-benchmark/zarr-python/src/zarr/core/common.py", line 65, in concurrent_map
    return await asyncio.gather(*[asyncio.ensure_future(run(item)) for item in items])
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/yongmingd/zarr-v3-benchmark/zarr-python/src/zarr/core/common.py", line 63, in run
    return await func(*item)
           ^^^^^^^^^^^^^^^^^
  File "/home/yongmingd/zarr-v3-benchmark/zarr-python/src/zarr/codecs/pipeline.py", line 262, in read_batch
    chunk_bytes_batch = await concurrent_map(
                        ^^^^^^^^^^^^^^^^^^^^^
  File "/home/yongmingd/zarr-v3-benchmark/zarr-python/src/zarr/core/common.py", line 65, in concurrent_map
    return await asyncio.gather(*[asyncio.ensure_future(run(item)) for item in items])
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/yongmingd/zarr-v3-benchmark/zarr-python/src/zarr/core/common.py", line 63, in run
    return await func(*item)
           ^^^^^^^^^^^^^^^^^
  File "/home/yongmingd/zarr-v3-benchmark/zarr-python/src/zarr/storage/common.py", line 71, in get
    return await self.store.get(self.path, prototype=prototype, byte_range=byte_range)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/yongmingd/icechunk/icechunk-python/python/icechunk/__init__.py", line 487, in get
    result = await self._store.get(key, byte_range)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ValueError: store error: unsuccessful repository operation: `error contacting storage error streaming bytes from object store streaming error`
Thread did not finish cleanly; forcefully closing the event loop.
Exception ignored in atexit callback: <function cleanup_resources at 0x7f5760ed7ec0>
Traceback (most recent call last):
  File "/home/yongmingd/zarr-v3-benchmark/zarr-python/src/zarr/core/sync.py", line 84, in cleanup_resources
    loop[0].close()
  File "/home/yongmingd/miniconda3/envs/zarr3/lib/python3.12/asyncio/unix_events.py", line 68, in close
    super().close()
  File "/home/yongmingd/miniconda3/envs/zarr3/lib/python3.12/asyncio/selector_events.py", line 101, in close
    raise RuntimeError("Cannot close a running event loop")
RuntimeError: Cannot close a running event loop
paraseba commented 4 days ago

Thanks for reporting this @dreamtalen . We should be retrying chunk fetch operations up to 3 times. To help us reproduce, could you give us more information about your environment? What object store are you using, what type of latency do you have to the region your data is in, what size of array do you need to reproduce this issue consistently? Anything else you can think of, that could help us reproduce, would be appreciated, we haven't seen this issue before.

dreamtalen commented 3 days ago

Hi @paraseba,

We’re currently using SwiftStack, which is an S3-compatible object storage.

I can reproduce this issue ~90% time with an array size of 64GB. While I don’t have specific latency number, both my client and the object storage server are located within the same cluster, so the latency should be quite low.

Additionally, I haven’t seen any retry attempts in the log; it appears that the operation fails on the first attempt.

paraseba commented 3 days ago

@dreamtalen I wonder if there is some small difference/incompatibility in the error code we get from SwiftStack, that doesn't trigger a retry in our S3 library. Do you get any more information on what was the response for the failing request is if you set RUST_LOG=debug in the environment?

dreamtalen commented 3 days ago

Hi @paraseba , I gave it a try but I couldn’t reproduce the error today. I agree it might be due to a difference in SwiftStack. If you’re not able to reproduce it with AWS S3, feel free to close the issue.

rabernat commented 2 days ago

@dreamtalen - Let's definitely keep this issue open until we are certain it is resolved! We are 100% committed to supporting your use case. It's just a little hard for us to debug directly.