commoncrawl / ia-web-commons

Web archiving utility library
Apache License 2.0
9 stars 6 forks source link

WAT extraction: handle duplicate HTTP response headers #18

Open cldellow opened 5 years ago

cldellow commented 5 years ago

See discussion in https://github.com/commoncrawl/ia-web-commons/pull/16#issuecomment-513296836

The HTTP protocol permits multiple headers with the same name to be returned in a response:

Multiple message-header fields with the same field-name MAY be present in a message if and only if the entire field-value for that header field is defined as a comma-separated list [i.e., #(values)]. It MUST be possible to combine the multiple header fields into one "field-name: field-value" pair, without changing the semantics of the message, by appending each subsequent field-value to the first, each separated by a comma. The order in which header fields with the same field-name are received is therefore significant to the interpretation of the combined field value, and thus a proxy MUST NOT change the order of these field values when a message is forwarded.

This happens commonly, for example the Cache-Control, Vary and Set-Cookie headers are often present more than once.

WAT extracts don't currently capture multiple headers due to https://github.com/commoncrawl/ia-web-commons/blob/c7e79be728f795e7a4bc0cd34475fcfef529838a/src/main/java/org/archive/resource/http/HTTPResponseResource.java#L64, which only captures the last header of a given name.

The JSON spec permits an object to have duplicate keys, so the WAT extracts could capture the underlying response more faithfully by overriding JSONObject.encode on MetaData to handle this scenario.

This change will be invisible to users using JSON parsers that bind to an object (JavaScript's JSON.parse('{"a": "a", "a": "b"}') => {"a": "b"}, e.g.). If an end-user wants to access the entire set of headers, they'll likely have to use a slightly less user-friendly interface to their JSON library.