dart-lang / http

A composable API for making HTTP requests in Dart.
https://pub.dev/packages/http
BSD 3-Clause "New" or "Revised" License
1.01k stars 346 forks source link

Decode with utf8 by default for non-text (or all?) content types #175

Open cdvv7788 opened 6 years ago

cdvv7788 commented 6 years ago

I am requesting some information from the a server, which returns the following (using postman):

Headers: Allow →GET, HEAD, OPTIONS CF-RAY →439e4801cf3db955-MIA Connection →keep-alive Content-Encoding →gzip Content-Type →application/json

Body: { "name": "SARA LUCIA OSSA PEÑA", }

But, using http, I am getting the following (using a simple request get('url')): { "name": "SARA LUCIA OSSA PEÑA", }

To fix this, I had to do something like: UTF8.decode(response.bodyBytes)

This works as expected, and the information is retrieved fine. This, however, is a pain to setup (and inconsistent with post, where utf8 is used as default encoding).

Is there a better way to handle this? An argument to the get parameter to force encoding? shouldn't application/json assume utf8 by default?

I came up with the solution after reading https://pub.dartlang.org/documentation/http/latest/http/Response-class.html and the body property. Probably it is encoding the body with a wrong format.

Anyway, thanks for the hard work. Awesome library.

zoechi commented 6 years ago

Good comments about this in https://stackoverflow.com/questions/9254891/what-does-content-type-application-json-charset-utf-8-really-mean

I don't think it's the HTTP clients job to do the decoding. The header is no guarantee that the content will be of that type. You can always build your own wrapper that does the UTF8-decoding for you so you don't have to repeat yourself.

cdvv7788 commented 6 years ago

@zoechi The client has to try to decode using a best effort approach. The http client is already decoding, but using the wrong encoding. There are 2 attributes in the response, body and bodyBytes. In the same link you send, it is mentioned:

Designating the encoding is somewhat redundant for JSON, since the default (only?) encoding for JSON is UTF-8

Knowing that the content type is json should be enough for the client to interpret it, or at least to avoid intepreting it with the wrong encoding. This is not an edge case, json is probably the most popular format for api communication at the moment. You are right on the wrapper part, but if the headers are coherent, the client should be able to handle the body properly too.

ghost commented 5 years ago

So, the situation is a bit murky, so bear with me..

The first layer in any of this is HTTP. HTTP, AFAICT, defines the default character set to be ISO-8859-1, which is why the Content-Type charset parameter exists as an explicit override. In Content-Type all parameters (charset included) are optional, but specific types may define their own required parameters.

JSON, in turn, explicitly does not define the charset parameter. JSON originally declared that it "shall be encoded in Unicode" with the default byte encoding being UTF-8. But that's a rather vague declaration, which is perhaps why they later amended it to be a requirement that JSON be encoded in UTF-8.

Next, the Dart http API defines (as mentioned above) two ways of interacting with the response data:

body → String The body of the response as a string. This is converted from bodyBytes using the charset parameter of the Content-Type header field, if available. If it's unavailable or if the encoding name is unknown, latin1 is used by default, as per RFC 2616.

bodyBytes → Uint8List The bytes comprising the body of this response.

So as discussed above, HTTP in absence of a defined charset is assumed to be encoded in ISO-8859-1 (Latin-1). And body from its description is consistent with this behaviour. If the server response sets the Content-Type header to application/json; charset=utf-8 the body should work as expected.

The problem of course is that there are servers out there that do not set charset for JSON (which is valid), but which is also a bit of a grey area in between the two specs:

A "smart" HTTP client could choose to follow the JSON definition closer than the HTTP definition and simply say any application/json is by default UTF-8 - technically violating the HTTP standard. However, the most robust solution is ultimately for the server to explicitly state the charset which is valid according to both standards.

As for this bug I'm inclined to say that http is working as intended, though the standards are a bit at odds with each other on this one. @cdvv7788, if you are able to you could add charset to your Content-Type on the server which should fix your issue. Alternatively if you're stuck with your server as-is, I recommend you try something like the httpserver example:

  HttpClientRequest request = await HttpClient().post(_host, 4049, path) /*1*/
    ..headers.contentType = ContentType.json /*2*/
    ..write(jsonEncode(jsonData)); /*3*/
  HttpClientResponse response = await request.close(); /*4*/
  await response.transform(utf8.decoder /*5*/).forEach(print);

Hope it helps.

I'll close this issue assuming all open questions are resolved.

cdvv7788 commented 5 years ago

Got it. Thanks for this.

tomchristie commented 5 years ago

So as discussed above, HTTP in absence of a defined charset is assumed to be encoded in ISO-8859-1

Note that applies to "text" media types. JSON is "application/json".

Correct clients should treat JSON (or any other non-text media type) responses as bytestrings, rather than text. (ie. use response.bodyBytes.)

ghost commented 5 years ago

@tomchristie, right. I also elaborated a bit on this in #186, but basically saying the same but with more words. :)

fabiocarneiro commented 5 years ago

I still believe this is a bad behavior and added comments to #186

renatoathaydes commented 4 years ago

This error also happens when the content-type is text/html (even when the HTML content says it's encoded as utf-8), and image/svg+xml (which also declares utf-8 in the content), for example. The fact that HTTP establishes a default encoding that's not UTF-8 is a sign of its age: today, I doubt you could do better than use UTF-8 as default for any text you get online.

gsouf commented 4 years ago

According to the standard for json, you are not actually allowed to use latin1 for the encoding of the contents. JSON content must be encoded as unicode, be it UTF-8, UTF-16, or UTF-32 (big or little endian). (https://stackoverflow.com/questions/9254891/what-does-content-type-application-json-charset-utf-8-really-mean)

I'm stuck with a server I don't have hands on and that does not return header specifying that the json content is using utf8. That is implicitely expected

renatoathaydes commented 4 years ago

@cskau-g the interpretation that HTTP uses ISO for text content is outdated and that requirement has been removed from the HTTP spec:

Appendix B of RFC-7231:

 The default charset of ISO-8859-1 for text media types has been
   removed; the default is now whatever the media type definition says.

Furthermore, the relevant part of the current spec does not mention at all a default charset to be applied to textual representations for any media-type:

https://tools.ietf.org/html/rfc7231#section-3.1.1.2

The JSON RFC, meanwhile, determines that the charset when used in conjunction with application/json should have no effect:

Note:  No "charset" parameter is defined for this registration.
      Adding one really has no effect on compliant recipients.

It has also been amended to make UTF-8 mandatory in the case of data transmitted over a network, which is the primary use-case for HTTP:

https://tools.ietf.org/html/rfc8259#appendix-A

 Section 8.1 was changed to require the use of UTF-8 when
      transmitted over a network.

Section 8.1:

JSON text exchanged between systems that are not part of a closed
   ecosystem MUST be encoded using UTF-8 [RFC3629].

Previous specifications of JSON have not required the use of UTF-8
   when transmitting JSON text.  However, the vast majority of JSON-
   based software implementations have chosen to use the UTF-8 encoding,
   to the extent that it is the only encoding that achieves
   interoperability.

Hopefully, this is enough to show that the currently most widely used data exchange format on the internet is not supported correctly by the Dart HTTP Server. Please consider changing this behavior as keeping it as it is is only going to hurt Dart's standing for no good reason.

natebosch commented 3 years ago

Reopening to track - I do think we should consider changing the defaults since most users are likely to benefit.

Note that the expected pattern to use today when you know the result is json is jsonDecode(utf8.decode(response.bodyBytes)). We should consider changing it so that jsonDecode(response.body) works as well.

natebosch commented 2 years ago

Changing the default for all responses, or even for non-text responses, is breaking. At least one internal usage is impacted.

Changing the default only when the content type is application/json is more narrow and may be safer.

fabiocarneiro commented 2 years ago

As stated before in https://github.com/dart-lang/http/issues/186, this behavior is wrong and should be corrected. It doesn't matter if it breaks bc or not. Release a new major if that is necessary.

In 2018 we were talking about this with a lot of effort on explaining HTTP and it was just ignored. If it was taken into consideration at the time, everything would have been adopted today. How many years more do we need to wait?

crimsonvspurple commented 2 years ago

I believe both FF/Chrome, for quite a while, treats application/json as utf-8 by default ( e.g., https://bugzilla.mozilla.org/show_bug.cgi?id=741776 ).

Some systems have even deprecated application/json; charset=utf-8 such as Spring Boot ( https://docs.spring.io/spring-framework/docs/current/javadoc-api/org/springframework/http/MediaType.html#APPLICATION_JSON_UTF8_VALUE ).

Processing JSON as non-UTF8 by default makes no reasonable sense. Please make utf-8 as default. Thank you.

0xNF commented 2 years ago

I'd like to raise this issue again -- like @renatoathaydes noted, RFC7231 (circa 2014) supersedes 2616 (circa 1999) to make interpretation of application/json as anything except utf-8 an incorrect implementation of the specification.

I understand the suggested way to access json data from a response is to use the jsonDecode(utf8.decode(response.bodyBytes)) pattern, but this part of the implementation not only bites dart beginners who don't know that particular piece of lore, but is also flatly wrong from the perspective of modern RFC compliance.

miDeb commented 10 months ago

Hi, is there any status update on making utf8 the default for decoding json responses? It's not fun to discover that jsonDecode(response.body) is not standards compliant and should always have been jsonDecode(utf8.decode(response.bodyBytes)) everywhere in our application. Maybe the addition of a .json getter on http.Response that does the right thing could also be a possible improvement.

0xNF commented 10 months ago

You should be using response.bodyBytes instead of response.body, because the latter will try to decode into a string, which may cause exceptions that you aren't expecting.

miDeb commented 10 months ago

Thanks @0xNF for the correction, I mistyped (wouldn't have made sense to utf8.decode(response.body)), as that wouldn't even compile)

daenney commented 5 months ago

RFC 8259, the current RFC reference for application/json in the IANA Media Type Registry, obsoletes 7159 and states in section 8.1 Character encoding

JSON text exchanged between systems that are not part of a closed
ecosystem MUST be encoded using UTF-8 [[RFC3629](https://www.rfc-editor.org/rfc/rfc3629)].

Previous specifications of JSON have not required the use of UTF-8
when transmitting JSON text. However, the vast majority of JSON-
based software implementations have chosen to use the UTF-8 encoding,
to the extent that it is the only encoding that achieves
interoperability.
renatoathaydes commented 5 months ago

@daenney I mentioned this almost 4 years ago: https://github.com/dart-lang/http/issues/175#issuecomment-619593544

I suspect Google would have too much work to do if this was changed, hence it will probably stay as it is even when it's clearly failing to follow the specs.