Misinterpretation of rfc2616 in response.dart

dart-lang / http

A composable API for making HTTP requests in Dart.

https://pub.dev/packages/http

BSD 3-Clause "New" or "Revised" License

1.01k stars 352 forks source link

Misinterpretation of rfc2616 in response.dart #186

Closed fabiocarneiro closed 6 years ago

fabiocarneiro commented 6 years ago

Dart http is annoyingly defaulting to ISO-8859-1 when charset parameter is not present in Content-Typeheader (Content-Type: application/json) The explanation is provided in response.dart and I quote here:

/// The body of the response as a string. This is converted from [bodyBytes] /// using the charset parameter of the Content-Type header field, if /// available. If it's unavailable or if the encoding name is unknown, /// [LATIN1] is used by default, as per [RFC 2616][]. /// /// [RFC 2616]: http://www.w3.org/Protocols/rfc2616/rfc2616-sec3.html

The section 3.7.1 Canonicalization and Text Defaults talks specifically about subtypes of TEXT, I quote here:

The "charset" parameter is used with some media types to define the character set (section 3.4) of the data. When no explicit charset parameter is provided by the sender, media subtypes of the "text" type are defined to have a default charset value of "ISO-8859-1" when received via HTTP. Data in character sets other than "ISO-8859-1" or its subsets MUST be labeled with an appropriate charset value. See section 3.4.1 for compatibility problems.

Since application/json is NOT subtype of text, the above rule does NOT apply.

The section 3.7 Media Types says the following:

...implementations SHOULD only use media type parameters when they are required by that type/subtype definition.

that means that each type must define if a parameter like charset would be required, and to know about application/json we need to check the IANA definition, which can be found here.

As there is no other rule in rfc2616 that deals particularly with application subtypes, and the media type definition does not even mention the charset in the optional parameters (only in the note in the end), I think relying on this being present is a really bad idea. I think most of the http libraries actually default to UTF-8, and most of the communication that happens using this media type uses UTF-8, so a saner option could potentially to default also to UTF-8.

ghost commented 6 years ago

As you say, there is some interpretation involved when reading RFCs so the way I see it ambiguity is the problem here.

I've just posted a write-up on #175 for a very similar issue, but let me elaborate specifically about the RFCs you quote.

I agree that the first part of the quote only concerns text/* media type:

When no explicit charset parameter is provided by the sender, media subtypes of the "text" type are defined to have a default charset value of "ISO-8859-1" when received via HTTP. Source: RFC 2616

However the next sentence says:

Data in character sets other than "ISO-8859-1" or its subsets MUST be labeled with an appropriate charset value. See section 3.4.1 for compatibility problems. Source: RFC 2616

Which I interpret to say that media types that are not ISO-8859-1 text must set charset - which is to say ISO-8859-1 is the default if no charset is set.

The later RFC 7231 goes on to amend RFC 2616 with:

The default charset of ISO-8859-1 for text media types has been removed; the default is now whatever the media type definition says. Source: RFC 7231

Which I frankly don't think helps clarifying much since the IANA entries for e.g. text/html don't specify any default encoding, AFAICT.

As also discussed in #175 JSON was also later on amended to use UTF-8 by default in RFC 8259, but whether that retroactively affects the IANA media type is unclear to me.

When all is said and done, as you mention, behaviour when handling anything other than text/* is simply not defined in the HTTP/1.1 RFCs. HTTP is largely concerned with transport of data, not how to process it - especially the thousands application specific media types, which application/json is one of.

As also discussed further on #175 Dart http implements two interfaces on the response, body and bodyBytes. The behaviour of body is only really defined in HTTP for text/* media types, hence not for application/json. So for JSON, etc. bodyBytes is really the only meaningful interface. If you need a starting point you can check out #175 for an example of how to implement JSON handling in your application.

I hope that clarifies why we're doing what we're doing.

fabiocarneiro commented 6 years ago

However the next sentence says:

The title of the section is "Canonicalization and Text Defaults". This section is talking particularly about media subtypes of the "text". Nothing in this paragraph applies to application/json, which is the particular media type I believe the http module is incorrectly parsing.

If you open the IANA definition for application/json, there are two crutial things to observe. The media type was defined by rfc8259 and the latest observation in the document which I quote here:

Note: No "charset" parameter is defined for this registration. Adding one really has no effect on compliant recipients.

The rfc8259 talks about character encoding in section 8.1:

JSON text exchanged between systems that are not part of a closed ecosystem MUST be encoded using UTF-8 [RFC3629].

To use application/json you MUST obey rfc8259, and consequently you MUST use UTF-8 in any API that is public. The usage of application/json alone already implies the UTF-8 encoding and therefore there is no need for a charset parameter. Any complient client that receives the charset parameter, even when different from UTF-8 should simply ignore it and deal with the data as it was UTF-8. ...

but whether that retroactively affects the IANA media type is unclear to me.

I interpret to say that media types that are not ISO-8859-1 text must set charset

In my opinion, you can't interpret that the charset needs to be present when its clear in the media type definition its not even an optional parameter. It is also not cool to make the automatic charset parsing to something that never was defined to be the defaults for this particular media type, probably inconvenient to the majority of the users since most of the libraries in other languages I've tried so far behave differently (and there is even a concrete example with Postman on #175) .

The behaviour of body is only really defined in HTTP for text/* media types

This was unclear for me. Do you mean the current implementation only applies to text subtypes? It doesn't seem to be the case as the application/json is also being encoded as ISO-8859-1 (and therefore we have to convert). If this is what you mean, could you point me to the code where this exception about only handling text subtypes is defined?

I was able to solve the issue in application level, but I don't believe its the right place to do it. My goal here is more than solving the issue I had at hands.

I also don't think the discussion is over and therefore the issue should be reopened. As you saw from the other issue this will be very common because there is a large number of APIs that don't provide the charset parameter and are not required to. I'm not fully aware of the impact, but It might be nice to consider that since this package didn't reach stable version (1.0) yet, this might be the perfect time to make such a change before even more applications rely on this library.

One thing that I'm very sure is about is that at least the comment in response.dart is incomplete or misleading. It implies that the specification says something it doesn't.

ghost commented 6 years ago

As said, "there is some interpretation involved", and I absolutely do not mean to say my interpretation is necessarily the one and only interpretation.

However, to me the central point here is that http implements HTTP as per the likes of RFC 2616. It does not implement RFC 8259. JSON is, from the perspective of HTTP, an application specific format - it just happens to be frequently used over HTTP and just happens to be text based. HTTP is a transport protocol, and it only defines some processing of some subsets of media types (namely text/*). Put another way, HTTP doesn't use JSON, applications use HTTP to transport JSON. As you point out above it's then the responsibility of the application to handle the JSON bytes correctly.

As such http implements the HTTP standard in bodyBytes by giving users the bytes transported. Additionally http also provides a convenience function, body to handle the special case of text/* media types as per RFC 2616.

Ultimately, the HTTP standard does not define how client implementations should handle anything but text/*. It does however say quite clearly how to handle media types with the charset parameter:

HTTP/1.1 recipients MUST respect the charset label provided by the sender; and those user agents that have a provision to "guess" a charset MUST use the charset from the content-type field if they support that charset, rather than the recipient's preference, when initially displaying a document. Source: RFC 2616

For that reason I do believe the documentation on body is correct.

I hope that makes sense.

fabiocarneiro commented 6 years ago

As you point out above it's then the responsibility of the application to handle the JSON bytes correctly.

I agree with that, but I wouldn't have to anything in my application layer if the body was being provided as is, instead of automatically processed to latin1. Since http is doing something that is not necessary, I also have to do something that is not necessary. The library is messing up with my data.

If it's unavailable or if the encoding name is unknown, latin1 is used by default, as per RFC 2616.

There is nothing in RFC 2616 that supports the text above from body. RFC 2616 refers only to the case when the sender provides the charset. Most of the senders using application/json will be using RFC 8259, and if they are compliant they will NOT provice a charset.

I'm guessing another reason for resisting this change would be: https://github.com/dart-lang/http/blob/5f0d5579d27501aa23e04a0f02d7f6d27b1098d0/lib/src/response.dart#L27

The fact the body is stored represented in Uint8List forces a later charset decode. My knowledge is limited here, and my question is if this is really neccessary and there isn't a way to represent in a format that doesn't require later processing.

The fuctions are just trying to figure out which charset should be used. Instead of doing that, we would be able to apply charset decode ONLY in two cases:

There was a charset provided
No charset was provided by the sender AND the media type is a subtype of text.

ghost commented 6 years ago

The response is internally represented as Uint8List, i.e. a sequence of bytes, since that is what HTTP transports (though the standard calls them "octets"). What comes over the wire when using HTTP is always bytes. Treating response bodies as a blob of bytes is always correct as per RFC 2616. And this is why http offers Response.bodyBytes in its API.

The contentious point then is that RFC 2616 goes on to define special handling for text/* media types, which says you can decode the byte blob as text encoded with either the explicitly given or the default character set. This is what http's Response.body implements.

The behaviour of Response.body for text/* media types is correct and the comment on the method in the context of text/* is also correct. The confusion, as I see it, is that many users assume that Response.body is also sound for application/json, which it is not. By design.

I understand that to many users having good support for JSON would be really convenient. But the thing I'm pushing back against is modifying the API to be non-compliant with HTTP for the sake of convenience.

That said I do concede there are things we could do to try make http more helpful for users wanting to handle JSON:

To clear up the common misunderstanding we could add to the comment on Response.body to explicitly say that it is not sound for anything other than text/* media types.
Add usage examples for JSON to clearly demonstrate correct handling of this common case.
Maybe add something like Response.bodyJSON to the API to special case that media type for the sake of convenience.
Maybe even go as far as add an exception to Response.body if called on any media type other than text/*.

fabiocarneiro commented 6 years ago

But the thing I'm pushing back against is modifying the API to be non-compliant with HTTP for the sake of convenience.

I'm not proposing that and I also don't think its a good idea. I like to have the things where they belong and you are right that RFC 8259 might not belong there. That means I also don't believe its a good idea to have something like bodyJSON. If no better solution can be given however, I'd rather extract the logic of handling the encode from the response class and allow different logic to be applied depending on the media type.

The contentious point then is that RFC 2616 goes on to define special handling for text/* media types,

I see that, but the specification never says this should be applied as a default to ANY media type. The tricky thing here is the Uint8List representation, which you necessarily need to encode to be able to have the body as String. How to create a strategy that works for other media types without being a hassle to the developers?

Again in all the experience I had so far with other languages and libraries, I've never had the case I had to decode latin1 to utf8 in the application layer when using JSON. I would highlight particularly PHP, and although you might have strong opinions on the language, the way HTTP is handled in PHP was actually a community-driven effort and it lives outside of the language. I believe the interfaces designed there might be a useful resource for inspiration to this library (for example, I think response should not have request as property).

I understood that the representation in bytes is due to the nature of what http transports, but its clear this library is handling the charset detection and body encoding in a different way than the rest. I'm trying to do aditional research to be able to come up with a suggestion.

PS: Did you happen to come across the observation on top of this w3c document? Maybe those could be used as reference instead?

ghost commented 6 years ago

Ah, yes, you should never have to convert from bytes to latin1 to UTF-8. Especially because that is potentially lossy. Application developers are always able to extract handling of the body byte blob via something like:

import 'package:http/http.dart' as http;
import 'dart:convert';

var client = new http.Client();
client.get(myUrl)
    .then((response) => utf8.decode(response.bodyBytes)).then((bodyStr) {
      print(bodyStr);
    });

Developers wanting to use yet other media types will likewise have to implement their own application specific handling outside of http.

cdvv7788 commented 6 years ago

I agree with several points in the conversation:

application/json should not be treated as text/*
There needs to be more information about application/json, because that is widely used in the web

I checked how python requests works for this and they expose several options, like the fields raw and text, and functions like json(). json() is just a wrapper and may fail (raise an exception) but it is very convenient. Unlike requests, urllib is very barebones and leaves the decoding completely to the users. Probably that's the reason why requests is preferred in many cases.

I think that it all comes down to giving the user the tools to work effectively without going against the stablished standards. Compliant json servers do not have the charset and compliant clients should not decode utf-8 by default (no charset is present) as that is not written anywhere. Clients, however, could provide the tools to make this painless and avoid unnecesary verbosity. What you proposed about examples on usage could be a good start point. About adding a bodyJSON field, I think that a method wrapping conversion could be better. With the method we would avoid adding complexity on the client (logical branches to decide if that field should be populated or not) and just leave the decision to the user.

fabiocarneiro commented 6 years ago

Can the issue be reopened?

I suggest two paths here:

Body is removed and everyone has to rely on bodyBytes. Its explicit to everyone that there MUST be a byte encoding process at the application level, no matter which media type is being used.
The code that decides which encode to apply to bodyBytes and fill body is extracted from the response class and separate logic is provided for the most common media type (text subtypes, json..). It would be highly extensible with new media types being added accordingly to the contributor needs. This approach would be good for the future media types and be the most convenient to the user.

I tend to prefer the second one for the sake of convenience and single api, but I'd be satisfied with the fairness of the first.

fabiocarneiro commented 6 years ago

@cskau-g ?

ghost commented 6 years ago

I encourage you to open an issue for the improvements to the docs if you'd like to follow up on that.

This issue however will most likely remain closed as the original premise remains contended.

To address your two suggestions above:

Removing Reponse.body is not an option as that would be a breaking change - it'd break the code of all the users who're correctly using this method.
Likewise, moving the body logic elsewhere would most likely break existing users. But if we ignore that for a bit, I don't personally see the big difference between factoring body + a potential bodyJSON out, as opposed to simply having application layers implement their handling separate anyway. But as this is open source you're more than welcome to create a concrete proposal if you'd like to start a discussion around that.

Thanks

fabiocarneiro commented 6 years ago

Removing Reponse.body is not an option as that would be a breaking change - it'd break the code of all the users who're correctly using this method.

It could be deprecated and removed in the next major and the library didn't reach 1.0 yet. I think this library is following Semantic versioning, and as per its rules it is ok to break backwards compatibility at any time before 1.0. People should not be relying on the library on 0.x.

Likewise, moving the body logic elsewhere would most likely break existing users. But if we ignore that for a bit, I don't personally see the big difference between factoring body + a potential bodyJSON out, as opposed to simply having application layers implement their handling separate anyway. But as this is open source you're more than welcome to create a concrete proposal if you'd like to start a discussion around that.

This is again my recommended and favorite approach. If you are concerned about technical efforts it is similar, but there is a big conceptual difference between extracting the behavior related to each media type to a separate class and adding thousands of bodyX methods to this file.

Anyway, I propose to make the changes necessary myself. I'm just asking before so I don't waste my time, so it would be nice to get an agreement from you.