Open GoogleCodeExporter opened 9 years ago
abarth told me to file this here (I tested this in Chrome).
Original comment by annevankesteren
on 4 Nov 2012 at 2:21
Your description sounds right to me. What do you expect it to be and why?
Original comment by brettw@chromium.org
on 4 Nov 2012 at 3:42
The result is not consistent with e.g. http://™%80/ and http://™%20/ or any
other browser. (Still working on defining what I expect it to be in
http://url.spec.whatwg.org/ but I thought I'd report this anyway.)
Original comment by annevankesteren
on 4 Nov 2012 at 10:51
The behavior is very well-defined.
We first try to make the URL valid. This includes unescaping, etc. So first we
would convert the %80 to 'P' for example. We then convert to punycode. We
handle the case where the host is in escaped UTF-8. So e.g.
"%ef%bc%85%ef%bc%94%ef%bc%91.com" becomes "a.com" because that escape sequence
is a UTF-8 encoded full-width 'a'.
You can see the host canonicalization unit test examples here:
http://code.google.com/p/google-url/source/browse/trunk/src/url_canon_unittest.c
c
Search for "TEST(URLCanonTest, Host)"
We special-case space and a few characters for IE-compat like space. This is
not an allowed hostname character so we canonicalize to an escaped %20.
™ is not a valid host character so I would expect it to be rejected. For
invalid characters we encode them as UTF-8 and escape them to make an ASCII
string, and mark the result invalid.
Original comment by brettw@chromium.org
on 4 Nov 2012 at 5:24
How can you convert %80 to 'P'? That makes no sense whatsoever. %80 is outside
the ASCII range and not a valid utf-8 byte at that location.
™ gets turned into "tm" per IDNA2003 and into a Punycode string per IDNA2008.
I'm not sure why you'd say it's invalid.
Original comment by annevankesteren
on 5 Nov 2012 at 8:37
Re: %80 I looked up in the ASCII chart and got hex & dec confused. You're
probably right about TM, hopefully ICU does that transform.
Original comment by brettw@chromium.org
on 5 Nov 2012 at 6:32
Original issue reported on code.google.com by
annevankesteren
on 4 Nov 2012 at 2:20