apache / arrow-rs

Official Rust implementation of Apache Arrow
https://arrow.apache.org/
Apache License 2.0
2.62k stars 802 forks source link

Object store AWS/S3 client and gRPC misconfiguration #6760

Open crepererum opened 1 day ago

crepererum commented 1 day ago

Describe the bug :warning: I understand that this is a very niche issue, but I thought I share this trap with others.

If you accidentally point an S3 object_store client to an gRPC endpoint, it will happily read empty objects for most paths (i.e. all paths that are not covered by the gRPC endpoint). This can become quite a debug nightmare.

To Reproduce Set up a gRPC server, e.g. w/ tonic. The code example for the client sets this up to port 1234.

Then configure an S3 Client to point at it. It's important that you

let store = object_store::aws::AmazonS3Builder::new()
    .with_bucket_name("dummy")
    .with_client_options(
        object_store::ClientOptions::new()
            .with_allow_http(true)
            .with_http2_only(),
    )
    .with_endpoint("http://localhost:1234")
    .with_skip_signature(true)
    .build()
    .unwrap();

Expected behavior I was naively expecting the client to error.

Additional context gRPC for some bizarre reasons decides to not use the HTTP status code at all but instead a custom response header grpc-status. In our case, this is set to 12 for UNIMPLEMENTED, see https://grpc.github.io/grpc/core/md_doc_statuscodes.html .

The response body for UNIMPLEMENTED is empty. The content-length response header is set to 0 (that's required by the object_store client).

:arrow_right: So I think what we could do as some kind of safeguard would be to check the grpc-status response header and bail out if it is set.

Xuanwo commented 1 day ago

Interesting, so the gRPC server will return an empty body with a 200 status code, which coincidentally is a valid S3 response?

crepererum commented 1 day ago

S3 responses for GET operations are just the pure data, there's NO wrapping of the bytes into any object like JSON or XML. All metadata is transmitted via headers, which our client doesn't need (also to support alternative S3 implementations more easily).

tustvold commented 1 day ago

IMO this isn't something we should look to handle, if you point it at an HTTP server that returns 200 OK, how is it to know that this is wrong? Sure we could handle grpc-status, but it seems rather odd to me to be inspecting a protocol specific header.