dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
15.15k stars 4.71k forks source link

[Uri] Support WHATWG URL standard #29839

Open hyspace opened 5 years ago

hyspace commented 5 years ago

Hi all,

We noticed that System.Private.Uri is not following the WHATWG living standard of URL, which results in some different parsing result comparing to major browsers.

For example:

Uri.TryCreate("http:////example.com///", UriKind.Absolute, out var uri)

will return false, but the Standard is considering http:////example.com/// as valid input and will correct it to http://example.com///. If you try this example URL in any major browser you will get same result as Standard defined.

This type of difference made it difficult to use .NET and C# in web browsing related scenarios, like Crawlers or HTML Parsers.

The AngleSharp project is one of the most commonly used C# HTML Parser. Today, they have to implement their own Url Class to be able to parse URL in the way Stardard defined. I think it will be much better if C# core library can handle it correctly.

Alone with the Standard, there is a set of test cases about URL for browsers or web developers to verify their implementation of the Standard. Today System.Private.Uri is failing many tests from it, including the example above.

Is there any plan to let System.Private.Uri following WHATWG URL Standard?

davidsh commented 5 years ago

In general, the System.Uri class of .NET Framework and .NET Core aligns with the IETF RFC 3986 and RFC 3987. The WHATWG URL Standard is not something that we have considered at this point.

We can consider it in the future as that standard becomes more mainstream. I also don't understand why WHATWG is not incorporating these interesting updates into the official IETF RFC standards.

davidsh commented 5 years ago

@karelz @wtgodbe

hyspace commented 5 years ago

I also don't understand why WHATWG is not incorporating these interesting updates into the official IETF RFC standards.

In short, releasing of RFC is too slow comparing to fast evolving web technology today1.

Before the new URL standard becomes RFC, all major browsers may have already implemented WHATWG Standard for years.

The WHATWG Standard have already became the factual standard today. Being left behind of the standard may result in security vulnerabilities for C# applications. For example, consider a program using System.Uri to detect malware link in web pages. Today, intentionally malformed links will be considered "invalid" by System.Uri, but actually it can be correctly opened by browsers. If this program want to parse those links correctly, they need to stop using System.Uri.

Following WHATWG Living URL Standard will be very different from implementing RFC. RFC is fixed until new one come out, on the other hand living standard is changing over time. To implement living standard, we need to publish updates frequently to catch up with standard. At this perspective, I'm not sure if .Net Frameworks or .Net Core should really choose this approach. Letting the community to implement WHATWG Standard on their own is not unacceptable, but it will become a burden for them.

There should not be any compatibility concern to follow WHATWG Standard. One major principles of WHATWG Standard is backward compatibility. Usually only new features are added when standard updates.

1 History

scalablecory commented 5 years ago

Are there any specific use cases you're targeting that current Uri does not support?

hyspace commented 5 years ago

@scalablecory Generally there are 2 use cases we are targeting that current Uri does not support.

We want to validate whether an URL is able to be opened in major browsers.

I would like to use same example in the issue:

http:////example.com///

Browser is able to open it, but Uri says it is invalid.

We want to know if two URLs is actually pointing to same destination.

Example

http://example.com/path
http://example.com/\tpath
http:////example.com/path
http://examp\nle.com/path

Are those URLs same or not? Uri says first 2 are valid but different, last 2 are not valid But actually, browser will open same page for all 4 URLs. (\t and \n means TAB and CR in my case)

How should URL class parse those URLs is clearly defined in the WHATWG URL Living standard today.

By the way, we are a team under Experiences and Devices, I have shared detailed use cases internally.

GSPP commented 5 years ago

I, too, have noticed that Uri is not good enough for certain web workloads where you deal with dirty data.

To me, it seems that the dirtiness of the real web cannot be captured reasonably in Uri. This should stay out of scope.

It is correct and good that AngleSharp implements its own Uri class. Their idea of what constitutes a Uri is quite different from what the framework should implement.

karelz commented 5 years ago

Triage: Should be part of overall Uri modernization effort. Sounds like reasonable direction to follow.

hyspace commented 5 years ago

@karelz Could you provide more information about plan of "Uri modernization"?

karelz commented 5 years ago

@hyspace we don't have more information or plans at the moment -- we just know we need to modernize the space, fix bugs, look at new standards, etc.

SergioBenitez commented 3 years ago

Note that http:////example.com/// is a valid absolute URI according to RFC 3986, so this is in fact a bug, not a feature request. Here is the production:

scheme
    ALPHA 'h'
    *ALPHA 't' 't' 'p'
':'
hier-part
    "//"
    authority
        host
            *regname
    path-abempty
        "/"
        "/"
        "/"
        "/" segment
            *pchar
                "e" "x" "a" "m" "p" "l" "e" "." "c" "o" "m"
        "/"
        "/"
        "/"

Here's the consolidated and relevant grammar from RFC 3896, for reference.

absolute-URI  = scheme ":" hier-part [ "?" query ]

scheme        = ALPHA *( ALPHA / DIGIT / "+" / "-" / "." )

hier-part     = "//" authority path-abempty
             / path-absolute
             / path-rootless
             / path-empty

query         = *( pchar / "/" / "?" )

authority     = [ userinfo "@" ] host [ ":" port ]
userinfo      = *( unreserved / pct-encoded / sub-delims / ":" )
host          = IP-literal / IPv4address / reg-name
port          = *DIGIT

reg-name      = *( unreserved / pct-encoded / sub-delims )

path-abempty  = *( "/" segment )

segment       = *pchar

pchar         = unreserved / pct-encoded / sub-delims / ":" / "@"

unreserved    = ALPHA / DIGIT / "-" / "." / "_" / "~"
pct-encoded   = "%" HEXDIG HEXDIG
sub-delims    = "!" / "$" / "&" / "'" / "(" / ")"
             / "*" / "+" / "," / ";" / "="
KevinCathcart commented 3 years ago

Specific URI schemes are allowed to place further restrictions on the generic grammar of RFC3896. The HTTP(s) schemes are registered in RFC7230 which states:

A sender MUST NOT generate an "http" URI with an empty host identifier. A recipient that processes such a URI reference MUST reject it as invalid.

Indeed, RFC3896 explicitly mentions that the http scheme considers empty host value invalid.

Therefore per the RFCs, System.Uri is correct to reject such for the https scheme. If you change the scheme in that url to something like "bogus", then System.Uri will not do such additional validation, and will parse it as you show.

Even the WHATWG URL standard does not consider "http:////example.com///" to be a valid URL string, but it does specify an algorithm for how to parse such invalid URLs. That said, the WHATWG standard is a bit weird in that it admits the existence of some URLs that have no valid equivalent (since it considers username and password to never be allowed in a valid URL string).