dart-lang / http_parser

A platform-independent Dart package for parsing and serializing HTTP formats.
https://pub.dev/packages/http_parser
BSD 3-Clause "New" or "Revised" License
38 stars 28 forks source link

Default charset for "application/json" incorrect #39

Open isben opened 3 years ago

isben commented 3 years ago

When a response is received with the Content-Type header set to "application/json" (without specifying the charset) the parser incorrectly assumes the response to be encoded as ISO-8859-1 (Latin-1) The comments in the code refer to document RFC-2616 and incorrectly concludes the default encoding must be Latin-1. While the RFC indeed mentions a Latin-1 default encoding it does so only for text responses. No default is assumed for any other media type.

On the other hand, the IANA document describing the "application/json" mediatype (https://www.iana.org/assignments/media-types/application/json) says in its final note: No "charset" parameter is defined for this registration. Adding one really has no effect on compliant recipients. It is therefore correct not to add "charset=utf-8" to the "application/json" Content-Type header.

Finally, RFC-8259 says explicitly that JSON code should always be considered as being Unicode encoded, commonly in UTF-8 format.

Hence from all the above, the correct default charset for "application/json" mediatype must be "utf-8" whether the charset is present or not in the content-type header. This same reasoning is likely applicable to other mediatype but I didn't research any further.

kevmoo commented 3 years ago

Interesting idea! PR welcome!

wanjm commented 3 years ago

Yes, I think http_parser should give the correct default charset;

lrhn commented 2 years ago

What the "correct default charset" would be is a very complex topic. RFC 6657 updates RFC 2046 and ... possibly RFC 2616, I'm not even sure.

In RFC 6657, there is no default for text/* in general. A text subtype should either not support a charset parameter in the cases where the encoding is embedded in the data (like text/html where you can specify the charset in an HTML header tag), or it must mandate the charset parameter. The only exception is text/plain which keeps US-ASCII as default.

So, the proper default is no default, except for text/plain, and you should always have to explicitly specify an encoding if one isn't specified in a charset parameter.

For non-text/* media types, we don't have any reasonable chance of knowing all the special cases. We might be able to recognize text/json and text/...+json and treat it specially (default to UTF-8).

Or we could introduce an open registry of mediatype/subtype ↦ encoding mappings which people can add to, and which starts with text/plainascii, text/htmllating and text/(json|.*\+json)$utf8. Then protocols needing other encodings can add them and get the defaults they want. (Basically, a "configure the default encoding on the side" registry which the HTTP implementation interacts with to get its defaults, rather than hardwireing anything into the HTTP implementation itself.

(We could also recognize text/xml and text/...+xml, but there is no default, you're supposed to look for a BOM or encoding header inside the XML if there is no charset parameter.)