Open hyspace opened 5 years ago
In general, the System.Uri class of .NET Framework and .NET Core aligns with the IETF RFC 3986 and RFC 3987. The WHATWG URL Standard is not something that we have considered at this point.
We can consider it in the future as that standard becomes more mainstream. I also don't understand why WHATWG is not incorporating these interesting updates into the official IETF RFC standards.
@karelz @wtgodbe
I also don't understand why WHATWG is not incorporating these interesting updates into the official IETF RFC standards.
In short, releasing of RFC is too slow comparing to fast evolving web technology today1.
Before the new URL standard becomes RFC, all major browsers may have already implemented WHATWG Standard for years.
The WHATWG Standard have already became the factual standard today. Being left behind of the standard may result in security vulnerabilities for C# applications. For example, consider a program using System.Uri
to detect malware link in web pages. Today, intentionally malformed links will be considered "invalid" by System.Uri
, but actually it can be correctly opened by browsers. If this program want to parse those links correctly, they need to stop using System.Uri
.
Following WHATWG Living URL Standard will be very different from implementing RFC. RFC is fixed until new one come out, on the other hand living standard is changing over time. To implement living standard, we need to publish updates frequently to catch up with standard. At this perspective, I'm not sure if .Net Frameworks
or .Net Core
should really choose this approach. Letting the community to implement WHATWG Standard on their own is not unacceptable, but it will become a burden for them.
There should not be any compatibility concern to follow WHATWG Standard. One major principles of WHATWG Standard is backward compatibility. Usually only new features are added when standard updates.
1 History
Are there any specific use cases you're targeting that current Uri
does not support?
@scalablecory
Generally there are 2 use cases we are targeting that current Uri
does not support.
I would like to use same example in the issue:
http:////example.com///
Browser is able to open it, but Uri
says it is invalid.
Example
http://example.com/path
http://example.com/\tpath
http:////example.com/path
http://examp\nle.com/path
Are those URLs same or not?
Uri
says first 2 are valid but different, last 2 are not valid
But actually, browser will open same page for all 4 URLs.
(\t
and \n
means TAB
and CR
in my case)
How should URL class parse those URLs is clearly defined in the WHATWG URL Living standard today.
By the way, we are a team under Experiences and Devices, I have shared detailed use cases internally.
I, too, have noticed that Uri
is not good enough for certain web workloads where you deal with dirty data.
To me, it seems that the dirtiness of the real web cannot be captured reasonably in Uri
. This should stay out of scope.
It is correct and good that AngleSharp implements its own Uri class. Their idea of what constitutes a Uri is quite different from what the framework should implement.
Triage: Should be part of overall Uri modernization effort. Sounds like reasonable direction to follow.
@karelz Could you provide more information about plan of "Uri modernization"?
@hyspace we don't have more information or plans at the moment -- we just know we need to modernize the space, fix bugs, look at new standards, etc.
Note that http:////example.com///
is a valid absolute URI according to RFC 3986, so this is in fact a bug, not a feature request. Here is the production:
scheme
ALPHA 'h'
*ALPHA 't' 't' 'p'
':'
hier-part
"//"
authority
host
*regname
path-abempty
"/"
"/"
"/"
"/" segment
*pchar
"e" "x" "a" "m" "p" "l" "e" "." "c" "o" "m"
"/"
"/"
"/"
Here's the consolidated and relevant grammar from RFC 3896, for reference.
absolute-URI = scheme ":" hier-part [ "?" query ]
scheme = ALPHA *( ALPHA / DIGIT / "+" / "-" / "." )
hier-part = "//" authority path-abempty
/ path-absolute
/ path-rootless
/ path-empty
query = *( pchar / "/" / "?" )
authority = [ userinfo "@" ] host [ ":" port ]
userinfo = *( unreserved / pct-encoded / sub-delims / ":" )
host = IP-literal / IPv4address / reg-name
port = *DIGIT
reg-name = *( unreserved / pct-encoded / sub-delims )
path-abempty = *( "/" segment )
segment = *pchar
pchar = unreserved / pct-encoded / sub-delims / ":" / "@"
unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~"
pct-encoded = "%" HEXDIG HEXDIG
sub-delims = "!" / "$" / "&" / "'" / "(" / ")"
/ "*" / "+" / "," / ";" / "="
Specific URI schemes are allowed to place further restrictions on the generic grammar of RFC3896. The HTTP(s) schemes are registered in RFC7230 which states:
A sender MUST NOT generate an "http" URI with an empty host identifier. A recipient that processes such a URI reference MUST reject it as invalid.
Indeed, RFC3896 explicitly mentions that the http scheme considers empty host value invalid.
Therefore per the RFCs, System.Uri is correct to reject such for the https scheme. If you change the scheme in that url to something like "bogus", then System.Uri will not do such additional validation, and will parse it as you show.
Even the WHATWG URL standard does not consider "http:////example.com///" to be a valid URL string, but it does specify an algorithm for how to parse such invalid URLs. That said, the WHATWG standard is a bit weird in that it admits the existence of some URLs that have no valid equivalent (since it considers username and password to never be allowed in a valid URL string).
Hi all,
We noticed that
System.Private.Uri
is not following the WHATWG living standard of URL, which results in some different parsing result comparing to major browsers.For example:
will return
false
, but the Standard is consideringhttp:////example.com///
as valid input and will correct it tohttp://example.com///
. If you try this example URL in any major browser you will get same result as Standard defined.This type of difference made it difficult to use .NET and C# in web browsing related scenarios, like Crawlers or HTML Parsers.
The AngleSharp project is one of the most commonly used C# HTML Parser. Today, they have to implement their own
Url
Class to be able to parse URL in the way Stardard defined. I think it will be much better if C# core library can handle it correctly.Alone with the Standard, there is a set of test cases about URL for browsers or web developers to verify their implementation of the Standard. Today
System.Private.Uri
is failing many tests from it, including the example above.Is there any plan to let
System.Private.Uri
following WHATWG URL Standard?