crystal-lang / crystal

The Crystal Programming Language
https://crystal-lang.org
Apache License 2.0
19.45k stars 1.62k forks source link

Large (~13.8kb) HTTP gzip responses aren't decompressed #11252

Open syeopite opened 3 years ago

syeopite commented 3 years ago

Bug Report

If I have this server:

require "http/server"
require "compress/gzip"

server = HTTP::Server.new do |context|
  rand_ = Random.new(1234567899).hex(7071)
  io = IO::Memory.new(rand_)

  Compress::Gzip::Writer.open(context.response) do |gzip|
    IO.copy(io, gzip)
  end

  context.response.headers["content-encoding"] = "gzip"
  context.response.content_type = "text/plain"
end

address = server.bind_tcp 8080
puts "Listening on http://#{address}"
server.listen

And a client requesting it like so:


require "http"
require "json"
require "uri"

client = HTTP::Client.new(URI.parse("http://localhost:8080"))
client.compress = true
results = client.get("/")

puts results.body

I'd expect Crystal to decompress the response automatically, and give me the resulting data. While this is the behavior for smaller responses, it seems like anything larger than ~13.8kb causes this step to be skipped. The above code would just return random gibberish binary data, instead of the long hex string.

ysbaddaden commented 3 years ago

I reported a similar issue to @straight-shoota yesterday where HTTP:Client response's body is binary. I have yet to determine whether it's gibberish or if it didn't deflate the gzip encoding. It's very intermittent, and happens for small data (1kb) for a specific HTTP domain (api.airtable.com). Other domains used are fine.

straight-shoota commented 3 years ago

The issue here is that you write the content to the HTTP response before setting the Content-Encoding header. That means the content is sent without that header and the client has no idea that it is supposed to decompress it. You can test that with any HTTP client.

It works with smaller content sizes because the data is buffered.

You have to move the configuration lines setting content encoding and content type above the IO copy.

@ysbaddaden This appears to be unrelated to your problem.

straight-shoota commented 3 years ago

This is not a bug, but I believe we can improve the developer experience by trying to raise an error on incorrect usage.

Response#content_type= could easily raise an error if headers have already been written (we already have check_headers for that).

This is not easy to implement with setting arbitrary headers (such as in response.headers["content-encoding"] = "gzip") because the mutation happens in the Headers object which has no reference to the request and thus can't know whether it still accepts headers.

Maybe we could consider adding set_header and get_header methods to Response which could check for headers written like content_type.

syeopite commented 3 years ago

The issue here is that you write the content to the HTTP response before setting the Content-Encoding header. That means the content is sent without that header and the client has no idea that it is supposed to decompress it. You can test that with any HTTP client.

Oh thanks! But, the server code above was actually an attempt to replicate the issue I was getting locally, and it looks like I messed up on that. Oops 😅. From your explanation, the actual problem I'm getting might be more related to what @ysbaddaden is experiencing.

But the gist of it is the same; HTTP::Client doesn't seem to decompress correctly. Here's a example source code that relies on the website I was querying:

require "http"
require "json"
require "uri"

client = HTTP::Client.new(URI.parse("https://www.youtube.com"))
client.compress = true

post_data = {"context" => {"client" => {"hl" => "en", "gl" => "US", "clientName" => "WEB",
                                        "clientVersion" => "2.20210721.00.00", "clientScreen" => "WATCH_FULL_SCREEN"}},
"continuation" => "4qmFsgJIEhhVQ1h1cVNCbEhBRTZYdy15ZUpBMFR1bncaLEVnWjJhV1JsYjNNd0FqZ0JZQUZxQUxnQkFDQUE2Z01JUTJkU1JGRlZSVGs9",
}.to_json

yt_headers = HTTP::Headers{
  "Content-Type"    => "application/json",
  "Accept-Encoding" => "gzip",
}

results = client.post("/youtubei/v1/browse?key=AIzaSyAO_FJ2SlqU8Q4STEHLGCilw_Y9_11qcW8", body: post_data,
  headers: yt_headers)

# Should be string but instead it's some random binary data
results.body

This returns gibberish binary data, but replicating the same query on curl does result in the correct (and decompressed) JSON response.

straight-shoota commented 3 years ago

You're actively opting out of the automatic decompression by setting the Accept-Encoding header. This is documented here: https://crystal-lang.org/api/1.1.1/HTTP/Client.html#compression

You don't need to assign client.compress = true because that's the default. But you must not set a custom Accept-Encoding header or the client won't automatically decompress.

SamantazFox commented 3 years ago

You're actively opting out of the automatic decompression by setting the Accept-Encoding header. This is documented here: https://crystal-lang.org/api/1.1.1/HTTP/Client.html#compression

Though, it is not explained that passing the Accept-Encoding header disables the automatic decompression!

straight-shoota commented 3 years ago

If [...] no ˋAccept-Encodingˋ header is explicitly specified, ...

BrucePerens commented 2 years ago

I am hitting this or #11354, In the body the first bytes in the returned string are reported as "\u001F\x8B". I'm a lttle confused about whether this is reporting an extra zero byte or not. The Conttent-Encoding header says "gzip". It's intermittent - does not always occurr on the same server response.