Open FailedShack opened 4 years ago
It seems like browsers also leave empty segments (//
) intact, while resolving/simplifying .
and ..
segments. System.Uri
behaves in the same way.
@antiufo I believe it actually does make sense for browsers to leave them in, as this kind of URI normalization does technically change semantics and it's up to the server to handle it in some way or another. I would however hope no-one relies on that kind of behavior.
I think it's desirable to be able to perform this normalization in some way, given that that the most common choice is to handle these URIs the same way, Apache2 does it by default, for example. The truth is that while they it is valid to have empty path segments in an URL, they are usually the result of bugs in client-side code. They are also against the recommendations of RFC 1630, which and I quote states:
The path is interpreted in a manner dependent on the protocol being used. However, when it contains slashes, these must imply a hierarchical structure.
///
does not imply a hierarchical structure, you would have empty folder names.
In any case, the documentation is incorrect in stating that this normalization is applied.
On the other hand, normalizing empty segments would mean that .NET would be unable to represent URLs that browsers are able to represent.
While the RFC might recommend against empty segments, they are sometimes used, and not just by erroneous client software.
https://en.wikipedia.org/wiki/// (disambiguation page for double slash character on Wikipedia) is for example a valid URL that contains two empty segments.
I agree the documentation should probably just be amended here. However, it would be nice if there was a utility method available to perform this normalization if desired. Otherwise, you have to rely on a workaround like this:
var parts = uri.PathAndQuery.Split(new char[] { '?' }, 2);
parts[0] = Regex.Replace(parts[0], "/+", "/");
uri.PathAndQuery = string.Join("?", parts);
It's not terrible but it feels a little roundabout and it's not entirely obvious.
Triage: We should decide if the update docs, or fix it ...
@FailedShack Out of curiosity, what is the context where you see yourself needing to remove empty segments? Do you have external data that erroneously contains double slashes? Are you composing URLs by concatenating path components?
It might be counter-intuitive to some, but not any more than typing "https://www.reddit.com/r//programming" in your browser's address bar and ending up with a 404 page...
@FailedShack Out of curiosity, what is the context where you see yourself needing to remove empty segments? Do you have external data that erroneously contains double slashes? Are you composing URLs by concatenating path components?
It might be counter-intuitive to some, but not any more than typing "https://www.reddit.com/r//programming" in your browser's address bar and ending up with a 404 page...
It's a mix of both, I'm handling requests from an external application that improperly concatenates path components and you end up with instances of double slashes. My application routes requests in a way similar to Flask or Spring Boot, so in cases where requests contain double slashes, patterns fail to match.
You can see an example of routing here. Routing patterns are glob-like. Here's where we currently fitted our normalization code cited in my previous comment.
@FailedShack
...I'm handling requests from an external application that improperly concatenates path components...
Would it be possible to fix the improper concatenation at the source then?
I was looking to use
System.Uri
to remove empty segments from URLs as its documentation claims.However, it appears that empty segments introduced by two consecutive slashes are not removed.
An URL like
http://example.com//oops////
should becomehttp://example.com/oops
. This is reproducible on the latest version of .NET Core (3.0.100) as well as .NET Framework 4.5.