ericmckean / google-url

Automatically exported from code.google.com/p/google-url
Other
0 stars 0 forks source link

IPv6 string literals need canonicalization #2

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?
1. Call CanonicalizeIPAddress() with "[2001:db8::1]"
2.
3.

What is the expected output? What do you see instead?
Expected: "[2001:db8::1]"
Actual: "[2001%3Adb8%3A%3A1]"

Please use labels and text to provide additional information.
Attached is a draft patch that only _half_ addresses the problem, at best.

[src/url_canon_ip.cc]
    * CanonicalizeIPAddress() calls new DoCanonicalizeIPv6Address() if
DoCanonicalizeIPv4Address() returns false.

    * New DoCanonicalizeIPv6Address() adds only basic IPv6 address validity
checks, then copies the input host component to the output.  The OS will
have to cope with the rest after that.  This is non-ideal (no
canonicalization is performed!) but it does allow basic IPv6 string literal
handling to proceed to the best of the calling applications ability.

[src/url_canon_unittest.cc]
    * New TEST(URLCanonTest, IPv6) test method.  Basic checks only.  Some
invalid strings will succeed.  A few of these cases are noted in the
comments within the test.

Note that callers need to be aware that .host() can return a string like
"[2001:db8::1]" (and not just the IPv6 address inside, i.e. never just
"2001:db8::1").  I'm not sure how to best warn users or protect them...

Original issue reported on code.google.com by ekl...@gmail.com on 12 Sep 2008 at 10:21

Attachments:

GoogleCodeExporter commented 9 years ago
Will is be too expensive to call inet_ntop(inet_pton(...)) for canonicalization?

I imagine it could be done as only the very last step when it seems most likely 
to
succeed.

Or is it not reasonable to assume client platforms will have a library with
AF_INET6-capable inet_XtoY()s?

Original comment by ekl...@gmail.com on 12 Sep 2008 at 10:53

GoogleCodeExporter commented 9 years ago
I am working on the canonicalization.
Here is what I've got so far: <http://codereview.chromium.org/99183>

Original comment by eroman@chromium.org on 29 Apr 2009 at 11:33

GoogleCodeExporter commented 9 years ago
I have an idea.  Let's say we have a GURL for "http://[::1]/".

- If someone asks for .host(), we should always return the *bare* IPv6 address, 
with
no brackets: "::1"
- However, .spec() should contain the brackets: "http://[::1]/"

I believe this difference is very important, because the brackets are *not* 
part of
the address, they are merely the method of encoding the address into a URL.

Specifically, Chromium should be able to take .host() and pass it directly to
getaddrinfo() without having to manually strip the brackets.

Original comment by pma...@google.com on 7 May 2009 at 6:29

GoogleCodeExporter commented 9 years ago
It seems confusing to me that the host wouldn't include everything necessary to 
re-
create the URL by concatenating it with the other parts of the URL. This would 
be a 
departure form the way everything works and there are some cases in the code 
that we 
would now have to modify to be IPv6 aware that don't right now.

We already have GURL.PathForRequest. How about adding HostForRequest?

Original comment by bre...@gmail.com on 7 May 2009 at 9:30

GoogleCodeExporter commented 9 years ago
> How about adding HostForRequest?

Hm... that sounds good to me.  It will also be necessary to audit the Chromium 
code
and figure out which host function to call in each place.

Original comment by pma...@google.com on 7 May 2009 at 5:10

GoogleCodeExporter commented 9 years ago
> We already have GURL.PathForRequest. How about adding HostForRequest?

As an aside, note that chromium at least does not use this. Instead it is using 
HttpUtil::SpecForRequest().

> This would be a departure form the way everything works and there are some 
cases in the code that we 
> would now have to modify to be IPv6 aware that don't right now.

Yeah, there are definitely callers that aren't expecting that...
Wan-Teh also felt host() should return without brackets.

Original comment by eroman@chromium.org on 7 May 2009 at 7:16

GoogleCodeExporter commented 9 years ago
I agree with pmark's comment 3, but I also realize that changing host()
may break some existing code that builds a new URL or a host:port string
manually using the components of a URL.

Having two host functions would be a good solution.  Alternatively we
can audit code that manually constructs host:port strings to escape
host with [] if host contains a colon.

Original comment by wtc@chromium.org on 7 May 2009 at 8:31

GoogleCodeExporter commented 9 years ago
I've discovered another subtlety:

When a browser makes an HTTP request, it includes a "Host:" parameter, which 
may be
followed by a port number:

http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.23

Thus, a client will need to:
- Use 2001:db8::1 when doing an address lookup.
- Use [2001:db8::1] when generating the "Host" field for an HTTP request.

So, HostForRequest() would be a bad name.  My current suggestion is:
- leave host() as-is.
- Add a PlainHost() which strips brackets if they exist.

Original comment by pma...@google.com on 10 May 2009 at 12:21

GoogleCodeExporter commented 9 years ago
Another option is:
- Change host() to strip brackets if they exist
- Add a method for concatenating a host and a port,
  for use in the HTTP "Host" header, proxy list, etc.

What do you think

Original comment by wtc@chromium.org on 11 May 2009 at 6:38

GoogleCodeExporter commented 9 years ago
> What do you think

pmarks has proposed the method HostNoBrackets() 
<http://codereview.chromium.org/113187>

Original comment by eroman@chromium.org on 11 May 2009 at 6:44

GoogleCodeExporter commented 9 years ago
If we did do a host+port function, we'd have to consider when to include ":80". 
 I
believe that Firefox will include a :80 iff it was explicitly typed as part of 
the URL.

Original comment by pma...@google.com on 11 May 2009 at 6:59

GoogleCodeExporter commented 9 years ago
What is the host + port function needed for?

Original comment by bre...@gmail.com on 11 May 2009 at 7:26

GoogleCodeExporter commented 9 years ago
> What is the host + port function needed for?

wtc suggested it in comment 9.

host+port (e.g. "[2001:db8::1]:8080") will be needed for the HTTP "Host:" 
header.

However, the current plan (with host(), port(), and HostNoBrackets()) sounds 
good to
me, because it gives you the components to assemble whatever you need:

- For getaddrinfo(), use HostNoBrackets()
- For the "Host:" header, concatenate host() + port()
- For hostname canonicalization, use host()
- To reassemble the original URL, concatenate ... + host() + port() + ...

Original comment by sparkm...@gmail.com on 11 May 2009 at 7:50

GoogleCodeExporter commented 9 years ago
[sorry, I'm mixing up my accounts.  pmarks@google == sparkmaul@gmail]

Original comment by pma...@google.com on 11 May 2009 at 7:54

GoogleCodeExporter commented 9 years ago
Sounds good. I don't want to add too many crazy functions to GURL or it will 
become
too difficult to understand.

Original comment by bre...@gmail.com on 11 May 2009 at 8:09

GoogleCodeExporter commented 9 years ago
The issue with HostNoBrackets() is that users need to learn when
to use host() and when to use HostNoBrackets().

The issue with a host+port function is that users need to learn
they should not concatenate host and port manually.

I proposed a host+port function because it seems easier to learn
the latter, and variants of a host+port function already exist:
- GetHostAndPort: in "net/base/net_util.h".  These are parsing
  functions.
- ProxyServer::host_and_port: in "net/proxy/proxy_server.h"

It seems that host() should only be used if the host is to be
concatenated with a port manually.  Is there any other context
in which an IPv6 address literal also needs to be quoted with
square brackets?

Original comment by wtc@chromium.org on 11 May 2009 at 8:37

GoogleCodeExporter commented 9 years ago
I personally don't understand the interactions well enough to know what the 
impact
would be of changing the meaning of host().

For example, someone will have to figure out if the change breaks WebKit:
http://trac.webkit.org/browser/trunk/WebCore/platform/KURL.h

Original comment by pma...@google.com on 11 May 2009 at 8:49

GoogleCodeExporter commented 9 years ago
I guess the real issue here is deciding whether the canonical representation of 
an
IPv6 address in a Web context should be :: or [::].  Should all dialog boxes,
cookies, etc. include the brackets?  I think it's easier for everyone if the 
brackets
are included, rather than forcing everyone to mentally box and unbox the 
addresses
when moving between URLs and non-URLs.

A lot of code out there implicitly assumes that host:port can be concatenated, 
and
changing that would result in a lot of bugs.  IP address literals are rare 
enough in
practice that it's not worth all the re-education, both of programmers and 
users.

For the handful of cases where the host needs to be resolved into a 
machine-readable
address, we can use HostNoBrackets().  Everyone who only cares about the host 
as an
opaque token can just keep using host().

Original comment by pma...@google.com on 12 May 2009 at 7:57

GoogleCodeExporter commented 9 years ago
- IPV6 helper HostNoBrackets() committed in r102
- IPv6 canonicalizer committed in r104.

Original comment by eroman@chromium.org on 19 May 2009 at 10:33