cppalliance / http_proto

HTTP/1 parsing and serialization algorithms using C++11
https://develop.http-proto.cpp.al/
Boost Software License 1.0
24 stars 10 forks source link

`Transfer-Encoding` and `Content-Encoding` #109

Open ashtum opened 2 months ago

ashtum commented 2 months ago

To implement automatic decoding in the parser, we first need to detect the encoding of the body. This task is complicated by the existence of two headers that determine the encoding: Content-Encoding and Transfer-Encoding. While both influence the decoding process, they serve different purposes. The Transfer-Encoding header, in particular, is designed for use by proxies, as it is a hop-by-hop header applied to a message between two nodes rather than to the resource itself. Consequently, each segment of a multi-node connection may use a different Transfer-Encoding value.

Here is what RFC 7230 says about Transfer-Encoding:

   Transfer-Encoding is primarily intended to accurately
   delimit a dynamically generated payload and to distinguish payload
   encodings that are only applied for transport efficiency or security
   from those that are characteristics of the selected resource.

   A recipient MUST be able to parse the chunked transfer coding
   (Section 4.1) because it plays a crucial role in framing messages
   when the payload body size is not known in advance.  A sender MUST
   NOT apply chunked more than once to a message body (i.e., chunking an
   already chunked message is not allowed).  If any transfer coding
   other than chunked is applied to a request payload body, the sender
   MUST apply chunked as the final transfer coding to ensure that the
   message is properly framed.  If any transfer coding other than
   chunked is applied to a response payload body, the sender MUST either
   apply chunked as the final transfer coding or terminate the message
   by closing the connection.

   For example,

     Transfer-Encoding: gzip, chunked

   indicates that the payload body has been compressed using the gzip
   coding and then chunked using the chunked coding while forming the
   message body.

   Unlike Content-Encoding (Section 3.1.2.1 of [RFC7231]),
   Transfer-Encoding is a property of the message, not of the
   representation, and any recipient along the request/response chain
   MAY decode the received transfer coding(s) or apply additional
   transfer coding(s) to the message body, assuming that corresponding
   changes are made to the Transfer-Encoding field-value.  Additional
   information about the encoding parameters can be provided by other
   header fields not defined by this specification.

However, searching through the internet, it seems that in practice, only chunked Transfer-Encoding is commonly implemented by servers and client tools:

Another complicating factor is the potential for Content-Encoding to contain multiple encoding methods. These methods must be decoded in the order in which they were applied, but our current design only supports a single decoder (filter):

Content-Encoding: deflate, gzip

I couldn't find sufficient evidence to determine whether multiple encoding methods are commonly used in practice. The closest related discussion I found is : how to disable Nginx double gzip encoding.

Assuming that multiple encodings in Content-Encoding are rarely encountered, the following approach could be considered for implementation:

vinniefalco commented 2 months ago

Currently the decision to encode or decode is a manual process delegated to the user. For now I think this is fine, as it lets us develop the rest of the code which is more complicated.