URL parsing of an invalid host has a defect

AttentionZ / google-url

Automatically exported from code.google.com/p/google-url

Other

0 stars 0 forks source link

URL parsing of an invalid host has a defect #32

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago

http://™%00/ mangles input. The input is considered invalid, but not echoed 
literally but rather first encoded as utf-8 and then converted back to code 
points using byte inflation rather than utf-8 decode.

Original issue reported on code.google.com by annevankesteren on 4 Nov 2012 at 2:20

GoogleCodeExporter commented 9 years ago

abarth told me to file this here (I tested this in Chrome).

Original comment by annevankesteren on 4 Nov 2012 at 2:21

GoogleCodeExporter commented 9 years ago

Your description sounds right to me. What do you expect it to be and why?

Original comment by brettw@chromium.org on 4 Nov 2012 at 3:42

GoogleCodeExporter commented 9 years ago

The result is not consistent with e.g. http://™%80/ and http://™%20/ or any 
other browser. (Still working on defining what I expect it to be in 
http://url.spec.whatwg.org/ but I thought I'd report this anyway.)

Original comment by annevankesteren on 4 Nov 2012 at 10:51

GoogleCodeExporter commented 9 years ago

The behavior is very well-defined.

We first try to make the URL valid. This includes unescaping, etc. So first we 
would convert the %80 to 'P' for example. We then convert to punycode. We 
handle the case where the host is in escaped UTF-8. So e.g. 
"%ef%bc%85%ef%bc%94%ef%bc%91.com" becomes "a.com" because that escape sequence 
is a UTF-8 encoded full-width 'a'.

You can see the host canonicalization unit test examples here:
http://code.google.com/p/google-url/source/browse/trunk/src/url_canon_unittest.c
c
Search for "TEST(URLCanonTest, Host)"

We special-case space and a few characters for IE-compat like space. This is 
not an allowed hostname character so we canonicalize to an escaped %20.

™ is not a valid host character so I would expect it to be rejected. For 
invalid characters we encode them as UTF-8 and escape them to make an ASCII 
string, and mark the result invalid.

Original comment by brettw@chromium.org on 4 Nov 2012 at 5:24

GoogleCodeExporter commented 9 years ago

How can you convert %80 to 'P'? That makes no sense whatsoever. %80 is outside 
the ASCII range and not a valid utf-8 byte at that location.

™ gets turned into "tm" per IDNA2003 and into a Punycode string per IDNA2008. 
I'm not sure why you'd say it's invalid.

Original comment by annevankesteren on 5 Nov 2012 at 8:37

GoogleCodeExporter commented 9 years ago

Re: %80 I looked up in the ASCII chart and got hex & dec confused. You're 
probably right about TM, hopefully ICU does that transform.

Original comment by brettw@chromium.org on 5 Nov 2012 at 6:32