ArroyoSystems / arroyo

Distributed stream processing engine in Rust
https://arroyo.dev
Apache License 2.0
3.8k stars 220 forks source link

Pipeline fails when reading gzipped files in GCS #786

Open benjamin-awd opened 1 week ago

benjamin-awd commented 1 week ago

Came across a weird issue trying to read gzipped NDJSON files in GCS -- the pipeline fails with:

{"message":"Error: Generic GCS error: Header: Content-Length Header missing from response. Retrying..."},"target":"arroyo_storage"}

Sending a HEAD request to the actual object shows that the Content-Length indeed is missing

curl -I \
  -H "Authorization: Bearer $(gcloud auth print-access-token)" \
 "https://storage.googleapis.com/$BUCKET_PATH/1731283272-ac141e8d-2350-47d0-9b97-74faf53bb407.log.gz"

> HTTP/2 200 
cache-control: private, max-age=0
last-modified: Mon, 11 Nov 2024 00:01:12 GMT
vary: Accept-Encoding
x-goog-generation: 1731283272904269
x-goog-metageneration: 1
x-goog-stored-content-encoding: gzip
x-goog-stored-content-length: 4719827
content-type: application/x-ndjson

I managed to bypass the issue by modifying the upstream object_store crate to use x-goog-stored-content-length, but not sure if that's the correct way to go about it

let content_length = headers
        .get(CONTENT_LENGTH)
        .or_else(|| headers.get("x-goog-stored-content-length"))
        .context(MissingContentLengthSnafu)?;

Is there something in Arroyo that can handle this?

(Could be potentially be a similar issue as https://github.com/apache/libcloud/issues/1544)