OAI / OpenAPI-Specification

The OpenAPI Specification Repository
https://openapis.org
Apache License 2.0
28.91k stars 9.07k forks source link

Consider using RFC3986 encode form-urlencoded #1778

Closed chanjarster closed 3 months ago

chanjarster commented 5 years ago

3.0.2 spec section Support for x-www-form-urlencoded Request Bodies:

To submit content using form url encoding via RFC1866, ...

And, RFC1866 section 8.2.1 The form-urlencoded Media Type:

The form field names and values are escaped: space characters are replaced by `+', and then reserved characters are escaped as per [URL]

Indicates that using encoding specified in RFC1738 and a special treatment on space character

While according to HTML 5.2 section 4.10.21.6 URL-encoded form data:

See the WHATWG URL specification for details on application/x-www-form-urlencoded. [URL] ...

Which says:

Note: URLs can be used in numerous different manners, in many differing contexts. For the purpose of producing strict URLs one may wish to consider [RFC3986] [RFC3987] ...

So, please consider using RFC3986 instead of RFC1738, RFC1738 is too old(December 1994).

handrews commented 2 years ago

I believe the only difference is that /, ?, and ';' are no longer reserved within the query string in RFC 3986, correct? (And I agree, it should reference RFC 3986).

chanjarster commented 2 years ago

I believe the only difference is that /, ?, and ';' are no longer reserved within the query string in RFC 3986, correct? (And I agree, it should reference RFC 3986).

Yes

vinniefalco commented 2 years ago

Does anyone have a reference to the BNF/grammar for the "key=value" pairs? I wrote this:

query-params    = query-param *( "&" query-param )
query-param     = key [ "=" value ]
key             = *qpchar
value           = *( qpchar / "=" )

but is there an official IETF document (or addendum) that states this?

handrews commented 8 months ago

@vinniefalco there is no IETF document. There was a draft, but it was abandoned. WHATWG addresses it in their URL "living standard", although I assume like all of their specs it's defined in terms of browser implementation state machines as opposed to ABNF or anything like that.

vinniefalco commented 8 months ago

Thanks! (I find the WHATWG "spec" exceptionally frustrating)

handrews commented 8 months ago

It's not clear to me whether we can change an RFC reference in a patch release, so for now I'm putting this in 3.2.

karenetheridge commented 5 months ago

FWIW, RFC3986 is what JSON Schema's "format": "uri-reference" uses (which appears in the openapi schema a few times).

handrews commented 5 months ago

@karenetheridge yeah I think the chances that anyone else even realizes the difference, much less implements it, is pretty slim. Arguably, since RFC3986 is formally listed as updating RFC 1738 (the URL encoding that RFC1866 references), it should be automatically understood that 3986 supersedes the outdated requirements.

I'm not sure whether we should do anything here- it's really getting into the weeds, and I think we can rely on the IETF's Update/Obsolete linkage. Maybe. I hope?

handrews commented 5 months ago

I noticed (with the discussion around urlencoding { and }) that our examples do not urlencode ~, which would be required by RFC 1738 but is not required by RFC 3986. Of course, we weren't urlencoding that example correctly anyway, but... I don't think anyone at any time meant we should urlencode ~ just because RFC 1738 is very indirectly referenced. We reference 3986 normatively elsewhere.

handrews commented 5 months ago

@OAI/tsc review request: Is it worth doing anything here? To summarize:

Do we need to specifically note that our reference to RFC 1886 should incorporate RFC 3986 rules rather than RFC 1738 rules, or is that to be expected automatically because RFC 3986 formally obsoletes RFC 1738 already?

In terms of practical impact, I need to know for OASComply whether to validate using the RFC 1738 rules or the RFC 3986 rules.

ralfhandl commented 5 months ago

The WhatWG URL spec requires ~ to be percent-encoded:

[!NOTE] The application/x-www-form-urlencoded percent-encode set contains all code points, except the ASCII alphanumeric, U+002A (*), U+002D (-), U+002E (.), and U+005F (_).

mikekistler commented 5 months ago

I would be in favor of updating the RFC reference in the 3.2. My meager understanding of this topic is that the RFC switch is not simply a "clarification" so I don't think it could be done in a patch version on earlier versions.

handrews commented 5 months ago

@ralfhandl Ugh. The previous line in the WHATWG spec is probably the better one to quote here:

The application/x-www-form-urlencoded percent-encode set is the component percent-encode set and U+0021 (!), U+0027 (') to U+0029 RIGHT PARENTHESIS, inclusive, and U+007E (~).

However, that is only relevant when running the WHATWG application/x-www-form-urlencoded serializing algorithm. The WHATWG application/x-www-form-urlencoded parsing algorithm will, AFAICT, happily accept a literal ~, because their percent-decode algorithm copies bytes verbatim unless it sees a %.

My feeling is that if people want to use WHATWG's serialization rules, that's on them. It's not our responsibility to keep up with their "living standard" or sort out the grammatical (in the ABNF sense) implications of their pseudocode. They have, historically, been aggressively hostile to accommodating anyone else's use cases or concerns, for example refusing to support relative URI-references because there aren't enough "browser-related use cases".

As we are an API specification, I feel comfortable ignoring people who have made it clear that they have no interest in supporting APIs as used by non-browsers.

handrews commented 5 months ago

@mikekistler if it's really a change, I agree with you. My question is, "is this even a change?" I can't find a clear explanation from the IETF as to whether an RFC that formally obsoletes another RFC automatically takes effect, or whether it requires an updated citation.

We don't even cite RFC 1738 directly, so there's no citation to replace. We do normatively cite RFC 3986, but the way we get to 1738 is:

All the OAS citations are many years after 3986, so it was already clear that those rules were doubly-obsolete when we cited 1866.

For that matter, it's not entirely clear to me that RFC 1866 really requires percent-encoding those extra characters. §8.2.1 The form-urlencoded Media Type states:

  1. The form field names and values are escaped: space characters are replaced by '+', and then reserved characters are escaped as per [URL]; that is, non-alphanumeric characters are replaced by '%HH', a percent sign and two hexadecimal digits representing the ASCII code of the character.

Oddly enough, this mentions percent-encoding reserved characters, rather than unsafe characters. There's no errata for this RFC, so this is the text in question.

§2.2 URL Character Encoding Issues includes this text in a paragraph with the sub-heading "Unsafe":

Other characters are unsafe because gateways and other transport agents are known to sometimes modify such characters. These characters are "{", "}", "|", "\", "^", "~", "[", "]", and "`".

In that same section, under the sub-heading "Reserved":

Many URL schemes reserve certain characters for a special meaning: their appearance in the scheme-specific part of the URL has a designated semantics. If the character corresponding to an octet is reserved in a scheme, the octet must be encoded. The characters ";", "/", "?", ":", "@", "=" and "&" are the characters which may be reserved for special meaning within a scheme. No other characters may be reserved within a scheme.

The BNF in §5 also does not place the characters in question in the reserved production:

safe           = "$" | "-" | "_" | "." | "+"
extra          = "!" | "*" | "'" | "(" | ")" | ","
national       = "{" | "}" | "|" | "\" | "^" | "~" | "[" | "]" | "`"
punctuation    = "<" | ">" | "#" | "%" | <">

reserved       = ";" | "/" | "?" | ":" | "@" | "&" | "="
hex            = digit | "A" | "B" | "C" | "D" | "E" | "F" |
                 "a" | "b" | "c" | "d" | "e" | "f"
escape         = "%" hex hex

unreserved     = alpha | digit | safe | extra
uchar          = unreserved | escape
xchar          = unreserved | reserved | escape

The term "national" is not used anywhere else in the RFC, nor do any other BNF rules depend on it. Which means the national characters are not part of uchar or xchar (the relevant productions for what characters are legal).

But RFC 1866 only mentions "reserved" characters. Not "unsafe" characters, or "characters outside of the legal set."


So... my feeling, after spending way too much time reading specs for this, is that anyone who is going to be legalistic enough to complain about RFC 1738 being cited by RFC 1866 can be countered by the fact that the most direct reading of RFC 1866 doesn't require encoding the national set of characters in the first place.

handrews commented 5 months ago

I'm now working on a PR describing the URL-encoding processes for query strings and request bodies in detail, and one way or another this will get addressed. So I'm taking the review label off.

handrews commented 3 months ago

PR merged for 3.0.4 and ported to 3.1.1 via PR #3921! This is now addressed by the new Appendix E. I'm not sure how happy it will make anyone, as the topic of percent-encoding is a mess with conflicting requirements depending on whose spec you read, but we've at least laid all of that out and called attention to the possible interoperability issues.