apache / arrow-rs

Official Rust implementation of Apache Arrow
https://arrow.apache.org/
Apache License 2.0
2.32k stars 682 forks source link

`error decoding response body` after upgrade to object store 0.10 #5882

Open ion-elgreco opened 3 weeks ago

ion-elgreco commented 3 weeks ago

Describe the bug We bumped the object store to 0.10 in delta-rs, and now we already seeing a couple reports on the following error error decoding response body. Happens on Azure and S3.

See https://github.com/delta-io/delta-rs/issues/2595 and https://github.com/delta-io/delta-rs/issues/2592

To Reproduce Seems to occur when reading tables or doing operations on them.

Expected behavior Don't have an issue decoding the response body

Additional context

@thomasfrederikhoeck @k-ye

tustvold commented 2 weeks ago

I think we would need a reproducer to action this, the linked issues aren't even clearly implicating object_store

Xuanwo commented 2 weeks ago

Please also print the source of the error via Debug print. Usually, it should be caused by connection reset or similar network related errors.

ion-elgreco commented 2 weeks ago

@thomasfrederikhoeck @k-ye can you guys provide additional details please

thomasfrederikhoeck commented 1 week ago

@Xuanwo I would love to be of more help but I don't now how to do this in delta-rs (an in turn object_store). I didn't help setting the timeout to 300s.

@ion-elgreco Can you point me in the direction of how I can provide better logs?

Xuanwo commented 1 week ago

@Xuanwo I would love to be of more help but I don't now how to do this in delta-rs (an in turn object_store). I didn't help setting the timeout to 300s.

Hi, if you can consistently reproduce this issue, please change the following places:

https://github.com/delta-io/delta-rs/blob/d17ed97b5bda0cadbc0df959f8fb38e275570c87/python/src/error.rs#L41-L51

fn object_store_to_py(err: ObjectStoreError) -> PyErr {
    match err {
        ObjectStoreError::NotFound { .. } => PyFileNotFoundError::new_err(err.to_string()),
        ObjectStoreError::Generic { source, .. }
            if source.to_string().contains("AWS_S3_ALLOW_UNSAFE_RENAME") =>
        {
            DeltaProtocolError::new_err(source.to_string())
        }
        _ => PyIOError::new_err(err.to_string()),
    }
}

Don't use err.to_string(), print it's debug message instead.

thomasfrederikhoeck commented 1 week ago

@Xuanwo Ah thanks!! I get the following consistently :

Generic {
    store: "MicrosoftAzure",
    source: reqwest::Error {
        kind: Decode,
        source: reqwest::Error {
            kind: Body,
            source: TimedOut,
        },
    },
}

I also tried bumping the timeout to600s. I still get _internal.DeltaError: Failed to parse parquet: Parquet error: Z-order failed while scanning data: ArrowError(ExternalError(General("ParquetObjectReader::get_byte_ranges error: Generic MicrosoftAzure error: error decoding response body")), None) but I never hit the debug print in this case. I am however seeing a lot of

[2024-06-24T21:09:33Z INFO  object_store::client::retry] Encountered transport error backing off for 0.1 seconds, retry 1 of 10: error sending request for url (REDACTED)
[2024-06-24T21:13:00Z DEBUG hyper_util::client::legacy::client] client connection error: error shutting down connection
Xuanwo commented 1 week ago

[2024-06-24T21:09:33Z INFO object_store::client::retry] Encountered transport error backing off for 0.1 seconds, retry 1 of 10: error sending request for url (REDACTED)

I suspect there's an issue with the network connection between your environment and Azure.

Could you provide more details about your setup?

thomasfrederikhoeck commented 1 week ago

@Xuanwo I might be network related but I have some feeling that is related to how object_store or delta-rs handles if there is a lower throughput than within a Azure data center (some connections going stale while waiting for somthing else).

The benchmark took 1+ hours with no failure while the delta-rs call fails within a few minutes.