dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
14.95k stars 4.65k forks source link

[Uri] System.Uri does not compact empty path segments #31300

Open FailedShack opened 4 years ago

FailedShack commented 4 years ago

I was looking to use System.Uri to remove empty segments from URLs as its documentation claims.

As part of canonicalization in the constructor for some schemes, dot-segments and empty segments (/./, /../, and //) are compacted (in other words, they are removed). The schemes for which URI will compact these sequences include http, https, tcp, net.pipe, and net.tcp.

However, it appears that empty segments introduced by two consecutive slashes are not removed.

SystemUri

An URL like http://example.com//oops//// should become http://example.com/oops. This is reproducible on the latest version of .NET Core (3.0.100) as well as .NET Framework 4.5.

antiufo commented 4 years ago

It seems like browsers also leave empty segments (//) intact, while resolving/simplifying . and .. segments. System.Uri behaves in the same way.

FailedShack commented 4 years ago

@antiufo I believe it actually does make sense for browsers to leave them in, as this kind of URI normalization does technically change semantics and it's up to the server to handle it in some way or another. I would however hope no-one relies on that kind of behavior.

I think it's desirable to be able to perform this normalization in some way, given that that the most common choice is to handle these URIs the same way, Apache2 does it by default, for example. The truth is that while they it is valid to have empty path segments in an URL, they are usually the result of bugs in client-side code. They are also against the recommendations of RFC 1630, which and I quote states:

The path is interpreted in a manner dependent on the protocol being used. However, when it contains slashes, these must imply a hierarchical structure.

/// does not imply a hierarchical structure, you would have empty folder names.

In any case, the documentation is incorrect in stating that this normalization is applied.

antiufo commented 4 years ago

On the other hand, normalizing empty segments would mean that .NET would be unable to represent URLs that browsers are able to represent.

While the RFC might recommend against empty segments, they are sometimes used, and not just by erroneous client software.

https://en.wikipedia.org/wiki/// (disambiguation page for double slash character on Wikipedia) is for example a valid URL that contains two empty segments.

FailedShack commented 4 years ago

I agree the documentation should probably just be amended here. However, it would be nice if there was a utility method available to perform this normalization if desired. Otherwise, you have to rely on a workaround like this:

var parts = uri.PathAndQuery.Split(new char[] { '?' }, 2);
parts[0] = Regex.Replace(parts[0], "/+", "/");
uri.PathAndQuery = string.Join("?", parts);

It's not terrible but it feels a little roundabout and it's not entirely obvious.

karelz commented 4 years ago

Triage: We should decide if the update docs, or fix it ...

antiufo commented 4 years ago

@FailedShack Out of curiosity, what is the context where you see yourself needing to remove empty segments? Do you have external data that erroneously contains double slashes? Are you composing URLs by concatenating path components?

It might be counter-intuitive to some, but not any more than typing "https://www.reddit.com/r//programming" in your browser's address bar and ending up with a 404 page...

FailedShack commented 4 years ago

@FailedShack Out of curiosity, what is the context where you see yourself needing to remove empty segments? Do you have external data that erroneously contains double slashes? Are you composing URLs by concatenating path components?

It might be counter-intuitive to some, but not any more than typing "https://www.reddit.com/r//programming" in your browser's address bar and ending up with a 404 page...

It's a mix of both, I'm handling requests from an external application that improperly concatenates path components and you end up with instances of double slashes. My application routes requests in a way similar to Flask or Spring Boot, so in cases where requests contain double slashes, patterns fail to match.

You can see an example of routing here. Routing patterns are glob-like. Here's where we currently fitted our normalization code cited in my previous comment.

julealgon commented 3 weeks ago

@FailedShack

...I'm handling requests from an external application that improperly concatenates path components...

Would it be possible to fix the improper concatenation at the source then?