dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
14.62k stars 4.56k forks source link

Uri.Query should not start with a question mark #84955

Open MattMcL4475 opened 1 year ago

MattMcL4475 commented 1 year ago

Description

Currently, the Query property of a Uri starts with a question mark ?:

var uri = new Uri("foo://example.com:8042/over/there?name=ferret#nose");
Console.WriteLine(uri.Query);
// Result: ?name=ferret

This is incorrect according to RFC 3986 - Uniform Resource Identifier (URI): Generic Syntax. The question mark is a delimiter, and query does not include the question mark. .NET is the outlier for how it handles Uri.Query (see below), and it's been 18 years since the RFC 3986 standard. This inconsistent usage of the question mark is likely a causative factor for the common inconsistent usage of the question mark in Azure SAS tokens. Uri.Fragment has this same issue too (it also incorrectly starts with a delimiter #).

HTTP/1.1 [RFC 3986]

1.2.3. Hierarchical Identifiers The generic syntax uses the slash ("/"), question mark ("?"), and number sign ("#") characters to delimit components that are significant to the generic parser's hierarchical interpretation of an identifier.

3. Syntax Components

  foo://example.com:8042/over/there?name=ferret#nose
  \_/   \______________/\_________/ \_________/ \__/
   |           |            |            |        |
scheme     authority       path        query   fragment
   |   _____________________|__
  / \ /         \
  urn:example:animal:ferret:nose

Notice that in Java, Go, Python, and PHP, the canonical URI implementations do NOT include the ? in the query:

Java

import java.net.URI;
import java.net.URISyntaxException;

public class Main {
    public static void main(String[] args) {
        try {
            URI uri = new URI("foo://example.com:8042/over/there?name=ferret#nose");
            System.out.println(uri.getQuery());
        } catch (URISyntaxException e) {
            System.out.println("Error: " + e.getMessage());
        }
    }
}

Result: name=ferret

Go

package main

import (
    "fmt"
    "net/url"
)

func main() {
    uriString := "foo://example.com:8042/over/there?name=ferret#nose"
    uri, err := url.Parse(uriString)

    if err != nil {
        fmt.Println("Error parsing URI:", err)
        return
    }

    fmt.Println(uri.Query().Encode())
}

Result: name=ferret

Python

from urllib.parse import urlparse
url = urlparse("foo://example.com:8042/over/there?name=ferret#nose")
print(url.query)

Result: name=ferret

PHP

<?php
$url = "foo://example.com:8042/over/there?name=ferret#nose";
$query = parse_url($url, PHP_URL_QUERY);
print_r($query);
?>

Result: name=ferret

Reproduction Steps

var uri = new Uri("foo://example.com:8042/over/there?name=ferret#nose");

if (uri.Query.StartsWith('?'))
{
    throw new Exception("According to RFC 3986, a URI's query segment shall not include the question mark");
}

Expected behavior

It's expected that the question mark ? is not included in the string returned by uri.Query, so we should expect:

var uri = new Uri("foo://example.com:8042/over/there?name=ferret#nose");
Console.WriteLine(uri.Query);
// name=ferret

However, a breaking change is not the correct solution to this (please see the Proposed Solution section below).

Actual behavior

Currently, the question mark is included in uri.Query:

var uri = new Uri("foo://example.com:8042/over/there?name=ferret#nose");
Console.WriteLine(uri.Query);
// ?name=ferret

Regression?

No, Uri.Query in .NET Framework 1.1 (February 2002) pre-dates RFC 3986 (January 2005). Although it's been a considerable amount of time, .NET is the outlier for how it handles Uri.Query, and it's been 18 years since the RFC 3986 standard. This inconsistent usage of the question mark is likely a causative factor for the common inconsistent usage of the question mark in Azure SAS tokens.

Known Workarounds

uri.Query.TrimStart('?');

Configuration

No response

Other information

The root cause is here: https://github.com/dotnet/runtime/blob/9aefa9daa141bb7d9ba3f2b373d4b050c9b243fe/src/libraries/System.Private.Uri/src/System/Uri.cs#L1091

| UriComponents.KeepDelimiter causes the delimiter to be included in both Uri.Query and Uri.Fragment.

Proposed Solution

  1. Add Uri.QueryRfc3986 and Uri.FragmentRfc3986 that do not start with delimiters
  2. Annotate Uri.Query with the attribute: [Obsolete(message: "This starts with the question mark delimiter ('?'), which is non-compliant with RFC 3986. Use QueryRfc3986 instead.", error: false)]
  3. Annotate Uri.Fragment with the attribute: [Obsolete(message: "This starts with the hash delimiter ('#'), which is non-compliant with RFC 3986. Use FragmentRfc3986 instead.", error: false)]

The result would then look like this:

var uri = new Uri("foo://example.com:8042/over/there?name=ferret#nose");

#pragma warning disable CS0612 // Type or member is obsolete
Console.WriteLine(uri.Query);
// Result: ?name=ferret
#pragma warning restore CS0612

Console.WriteLine(uri.QueryRfc3986);
// Result: name=ferret

#pragma warning disable CS0612 // Type or member is obsolete
Console.WriteLine(uri.Fragment);
// Result: #nose
#pragma warning restore CS0612

Console.WriteLine(uri.FragmentRfc3986);
// Result: nose
MihaZupan commented 1 year ago

cc: @dotnet/ncl

Yes, according to RFC3986, query is the value following the ? separator. But I would argue that a lot of people (myself included) think of the query as including the leading ?.

The exact same quirk applies to the Fragment property. The # character is considered as only the delimiter by the spec. By extension, APIs on UriBuilder follow a similar pattern. Note that UriBuilder.Query/Fragment accept values without the leading ?/# and will add one for you if not already present. ASP.NET also exposes a QueryString type that always includes the leading separator. .NET behavior should continue to match between all of these.

As you point out, Uri.Query has behaved this way for a very long time, and yet I don't see other issues in this repo questioning the behavior. A rough GitHub code search shows tens of thousands of uses of the property, where ~10% of those are followed by some combination of .Substring(1), .TrimStart('?'), .Replace("?", ""). A change in behavior is unacceptable in this case, as it would break tons of existing uses, including those currently using workarounds such as Substring(1). I also do not believe that possible confusion due to the extra ? warrants obsoleting this API. Misuse in the minority case should be easy to catch by running the application/test. Augmenting the API documentation (especially IntelliSense) to point out this behavior would be a useful addition though:

-Gets any query information included in the specified URI.
+Gets any query information included in the specified URI, including the leading '?' character if not empty.
-Gets the escaped URI fragment.
+Gets the escaped URI fragment, including the leading '#' character if not empty.

While comparing the behavior of different language standard libraries, JavaScript does include the leading ?/#:

new URL("http://example.com:8042/over/there?name=ferret#nose").search
> '?name=ferret'
new URL("http://example.com:8042/over/there?name=ferret#nose").hash
> '#nose'

How impactful are the inconsistencies in Azure Storage's documentation in practice? My (of course biased) view is that developers working with REST APIs are/should be aware of what a ? in a url is.

With updates to the documentation to point out that leading delimiters are included by these properties, do you still believe a new API would lead to less confusion? An additional API represents another thing a developer has to consider, where a workaround such as .TrimStart('?') is both short and very clear in its intent.

MattMcL4475 commented 1 year ago

Thanks for your thoughtful reply @MihaZupan

Great point about Uri.Fragment. This is also incorrect according to RFC 3986, and I think it should also be marked [Obsolete] in favor of Uri.FragmentRfc3986. I've updated the issue above. I do think that this new API would lead to less confusion, since it informs the developer to carefully consider the implications of the .NET Uri implementation. Visual Studio will also add a green underline if they select the legacy Query or Fragment. This will make it less likely that a developer will make a Type II error (false negative) that results in a defect later in the SDLC. Later might be after a unit test runs, or it could be in production. It's cheapest to catch these defects in the IDE the first time the line of code is written.

I think .NET should strive to follow industry standards such as RFC 3986, especially in this case where all other major languages are compliant with it. I don't find an appeal to tradition persuasive in this case. In fact, I think that .NET's evolution away from its distinct, proprietary, closed-source traditions to its current fully open-source, cross-platform nature, supporting the latest standards, should be celebrated and continued. I also think that .NET should aim to minimize .NET idiosyncrasies.

I agree that a breaking change would be inappropriate. However, I think .NET should be actively discouraging developers from using Uri.Query and Uri.Fragment, and thus mark them [Obsolete]. You said that a lot of people, including yourself, think of the query as including the leading ?. I think this actually reinforces the underlying point, that the .NET community may have a mental model of query that is not aligned with either the standard nor the rest of the major programming languages, due to a non-compliant API that many of us have never questioned, and mistakenly assumed is correct. I don't think JavaScript is a valid comparison, since search and hash are different terms with different meanings. The main term I am taking issue with is query, which has a specific meaning as defined in RFC 3986.

I agree with updating the documentation and summary tag as well.

It's not just the Azure documentation that is inconsistent. For example, here is an Azure API that returns a SAS token with no question mark (correct). But, if you use the Azure Portal to generate a SAS token, they do include a question mark (incorrect). I think the Azure example is important and is in Microsoft's interest to resolve - but it's not central to this issue - more a symptom of the inconsistencies this issue may be causing.

MattMcL4475 commented 1 year ago

@MihaZupan can I edit this issue to make it an API proposal, or should I create a new issue for it and reference this discussion? Even if the team considers it low priority, I'd still like it to be in the backlog, for if/when the decision is made to make .NET Uri compliant with RFC 3986. Thank you!

MihaZupan commented 1 year ago

Obsoletion is a big hammer that should be reserved for cases where the API is irreparably broken, or where it's believed that the API is virtually unused and a better alternative exists. The difference of an extra ? does not meet that bar in my opinion. The majority of existing uses are also perfectly fine with, or rely on ? being present.

I would consider the proposed alternative to be even more error-prone. A nice benefit of existing behavior is that in most cases you don't have to deal with separators manually. If there is no query, you'll get back an empty string. Same for fragment.

Consider an existing use like

$"{uri.AbsolutePath}{uri.Query}{uri.Fragment}"

If the user is hit with the obsoletion warning and updates the code to use the "good" properties, they may end up with

$"{uri.AbsolutePath}{uri.QueryRfc3986}{uri.FragmentRfc3986}"
// or
$"{uri.AbsolutePath}?{uri.QueryRfc3986}#{uri.FragmentRfc3986}"

Both of which are functionally wrong.

The "correct" alternative for those users is substantially worse.

$"{uri.AbsolutePath}/{(uri.QueryRfc3986 == "" ? "" : $"?{uri.QueryRfc3986}")}{(uri.FragmentRfc3986 == "" ? "" : $"#{uri.FragmentRfc3986}")}"

I don't see aligning with some other languages to be of sufficient value to force those users into rewriting their code, ending up with a non-trivial amount of extra, or more error-prone code.

Potential misuse of the existing API in practice should be caught quickly on first use/test.

Given how trivial workarounds are, I see little value in adding an additional API here. We could consider it if it turns out that this is a common thing for people to run into though.


can I edit this issue to make it an API proposal

Feel free to update the original post. We can keep the issue open to see if more people are running into this.

Including Rfc3986 in the name is problematic IMO as it does not convey how it's different from the existing property. Both use the same encoding, following the same standard, the only difference is the leading delimiter for non-empty values.

What would happen with APIs on UriBuilder? It also exposes Query and Fragment properties with setters. What about ASP.NET Core's QueryString type and all the APIs that work with it? Making any of these inconsistent with each other is a net negative for the platform.


I don't find an appeal to tradition persuasive in this case.

If you're referring to this part of my comment As you point out, Uri.Query has behaved this way for a very long time, and yet I don't see other issues in this repo questioning the behavior, the point I was trying to make is that this doesn't appear to be such a widespread issue that is very common for people to run into.

MattMcL4475 commented 1 year ago

Agreed that making changes to this would have a ripple effect across many .NET classes and libraries and require examining all of them in detail.

Regarding this: $"{uri.AbsolutePath}{uri.Query}{uri.Fragment}"

uri.PathAndQueryAndFragment could be added to the API, since uri.PathAndQuery already exists.

Or, perhaps ToAbsolutePathQueryFragment() would be better.

Either one could encapsulate the concatenation logic within the Uri class itself.

Either way, if obsoletion is too big a hammer, and addition of a new API too onerous or duplicative - perhaps also updating the documentation to indicate that Query and Fragment are non-compliant with RFC 3986 would be helpful. Regardless of how frequently C# developers discover this quirk, or how trivial it may be to work around - I still think it's important to note that the widely-used C# class Uri doesn't comply with a fundamental industry standard followed by all other major web programming languages.

MihaZupan commented 1 year ago

perhaps also updating the documentation to indicate that Query and Fragment are non-compliant with RFC 3986 would be helpful

Isn't that what https://github.com/dotnet/dotnet-api-docs/pull/8977 is?

MattMcL4475 commented 1 year ago

Thanks @MihaZupan for your help on this. I just noticed in that doc, that it explicitly says the query information is escaped according to RFC 3986. It mentions "RFC 3986" 15 times:

curl -s https://raw.githubusercontent.com/dotnet/dotnet-api-docs/main/xml/System/Uri.xml | grep -o -i "RFC 3986" | wc -l

To me, this seems even more paradoxical that query and fragment are non-compliant with RFC 3986. Perhaps it would make more sense with a better understanding of the full commit history of this file.

Also, it omits explicitly stating that Query or Fragment themselves are non-compliant with RFC 3986 (seems like an error by omission to me). What do you think about making this explicit in the summary tag? https://github.com/dotnet/dotnet-api-docs/pull/8985/files

MihaZupan commented 1 year ago

Also, it omits explicitly stating that Query or Fragment themselves are non-compliant with RFC 3986 (seems like an error by omission to me). What do you think about making this explicit in the summary tag?

They do work according to the RFC, with the exception of also including the leading separator, which the message now calls out.

What is a user supposed to do with the information that 'Query' is non-compliant with RFC 3986.?

I think it just makes it a lot more confusing. An average user shouldn't have to be aware of what RFC 3986 is, and even if they are, what does "non-compliant" mean? In this case, it's that it includes the separator, which the sentence right before already says.

MattMcL4475 commented 1 year ago

I think it just makes it a lot more confusing. An average user shouldn't have to be aware of what RFC 3986 is, and even if they are, what does "non-compliant" mean? In this case, it's that it includes the separator, which the sentence right before already says.

We can agree to disagree :) I don't feel strongly about what the language is, or whether the user is told this information via an obsoletion message or Quick Info message, but the point would be to make the user aware what the true definition of query or fragment is, so they don't mistakenly misuse it and cause a bug, or documentation bug, or perpetuate the inaccurate definition of these terms.

MihaZupan commented 1 year ago

If you feel strongly about this, feel free to add a note along the lines of

Note that the property includes the leading delimiter (`?`), whereas the Uri specification (RFC 3986) recognizes the query as the portion of a Uri without the delimiter.

under the Remarks section in the documentation.