Open cldellow opened 5 years ago
After a closer look: allowing repeated keys/names in MetaData would require to change the underlying JSON implementation. That's maybe even not worth to explore, given that it would hardly make it into the upstream code, cf. iipc/webarchive-commons#84.
Opened iipc/webarchive-commons#98 for a solution which would break with backward-compatibility because the values of repeated headers are stored as a JSONArray.
See discussion in https://github.com/commoncrawl/ia-web-commons/pull/16#issuecomment-513296836
The HTTP protocol permits multiple headers with the same name to be returned in a response:
This happens commonly, for example the
Cache-Control
,Vary
andSet-Cookie
headers are often present more than once.WAT extracts don't currently capture multiple headers due to https://github.com/commoncrawl/ia-web-commons/blob/c7e79be728f795e7a4bc0cd34475fcfef529838a/src/main/java/org/archive/resource/http/HTTPResponseResource.java#L64, which only captures the last header of a given name.
The JSON spec permits an object to have duplicate keys, so the WAT extracts could capture the underlying response more faithfully by overriding
JSONObject.encode
onMetaData
to handle this scenario.This change will be invisible to users using JSON parsers that bind to an object (JavaScript's
JSON.parse('{"a": "a", "a": "b"}') => {"a": "b"}
, e.g.). If an end-user wants to access the entire set of headers, they'll likely have to use a slightly less user-friendly interface to their JSON library.