Open dreamtalen opened 4 days ago
Thanks for reporting this @dreamtalen . We should be retrying chunk fetch operations up to 3 times. To help us reproduce, could you give us more information about your environment? What object store are you using, what type of latency do you have to the region your data is in, what size of array do you need to reproduce this issue consistently? Anything else you can think of, that could help us reproduce, would be appreciated, we haven't seen this issue before.
Hi @paraseba,
We’re currently using SwiftStack, which is an S3-compatible object storage.
I can reproduce this issue ~90% time with an array size of 64GB. While I don’t have specific latency number, both my client and the object storage server are located within the same cluster, so the latency should be quite low.
Additionally, I haven’t seen any retry attempts in the log; it appears that the operation fails on the first attempt.
@dreamtalen I wonder if there is some small difference/incompatibility in the error code we get from SwiftStack, that doesn't trigger a retry in our S3 library. Do you get any more information on what was the response for the failing request is if you set RUST_LOG=debug
in the environment?
Hi @paraseba , I gave it a try but I couldn’t reproduce the error today. I agree it might be due to a difference in SwiftStack. If you’re not able to reproduce it with AWS S3, feel free to close the issue.
@dreamtalen - Let's definitely keep this issue open until we are certain it is resolved! We are 100% committed to supporting your use case. It's just a little hard for us to debug directly.
Hi icechunk team,
When attempting to read a large Zarr array using IceChunk, I encounter an intermittent error:
ValueError: store error: unsuccessful repository operation: error contacting storage error streaming bytes from object store streaming error
This issue can be reproduced simplely using
zarr_array[:]
, and it occurs more frequently as the array size increases. We think it is due to occasional streaming interruptions from the object store server, a retry would help here.Below is the full traceback for reference: