dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
15.15k stars 4.71k forks source link

[Uri] System.Uri rejects otherwise valid strings with length >= 65520 #1857

Closed airbreather closed 4 years ago

airbreather commented 4 years ago

The System.Uri class seems to reject URI strings greater than a strangely short length (experimentally, 65520 or higher, so probably this constant). I'm not able to find any documentation of this fact at https://docs.microsoft.com/en-us/dotnet/api/system.uri?view=netcore-3.1, nor does RFC 3986 discuss length restrictions (though I went through those two sources very quickly, so this could be my error).

Most URIs I've seen are short enough, but data URIs can easily be imagined to exceed this limit.

CA1056:UriPropertiesShouldNotBeStrings and related rules have been guiding me towards using this class to represent URI strings for a while now; their documentation does not list "you want to support URI strings that exceed 65519 characters in length" as good reasons to suppress warnings.

In my perfect world, I would appreciate it if the System.Uri class could represent all URI values that are legal in RFC 3986 (or at least the subset that can be represented by System.String).

I totally understand why an artificial restriction might appear as a result of engineering trade-offs, so perhaps a decent first step might be to document it better?

Repro program

The following C# program writes out the longest URI I could manage.

Incrementing the commented line causes it to throw an exception:

Unhandled exception. System.UriFormatException: Invalid URI: The Uri string is too long.
   at System.Uri.CreateThis(String uri, Boolean dontEscape, UriKind uriKind)
   at System.Uri..ctor(String uriString)
   at Program.Main() in <<AIRBREATHER_REDACTED>>\Program.cs:line 10

Program.cs

using System;
using static System.Console;

static class Program
{
    static void Main()
    {
        const int TextLength = 65489; // any higher will throw an exception
        string txt = new string('a', TextLength);
        WriteLine(new Uri($"data:text/plain;charset=utf-8,{txt}"));
    }
}

ConsoleApp0.csproj

<Project Sdk="Microsoft.NET.Sdk">

  <PropertyGroup>
    <OutputType>Exe</OutputType>
    <TargetFramework>netcoreapp3.1</TargetFramework>
  </PropertyGroup>

</Project>

System Information

>dotnet --info
.NET Core SDK (reflecting any global.json):
 Version:   3.1.101
 Commit:    b377529961

Runtime Environment:
 OS Name:     Windows
 OS Version:  10.0.18363
 OS Platform: Windows
 RID:         win10-x64
 Base Path:   C:\Program Files\dotnet\sdk\3.1.101\

Host (useful for support):
  Version: 3.1.1
  Commit:  a1388f194c

.NET Core SDKs installed:
  2.2.402 [C:\Program Files\dotnet\sdk]
  3.1.100 [C:\Program Files\dotnet\sdk]
  3.1.101 [C:\Program Files\dotnet\sdk]

.NET Core runtimes installed:
  Microsoft.AspNetCore.All 2.1.14 [C:\Program Files\dotnet\shared\Microsoft.AspNetCore.All]
  Microsoft.AspNetCore.All 2.2.7 [C:\Program Files\dotnet\shared\Microsoft.AspNetCore.All]
  Microsoft.AspNetCore.All 2.2.8 [C:\Program Files\dotnet\shared\Microsoft.AspNetCore.All]
  Microsoft.AspNetCore.App 2.1.14 [C:\Program Files\dotnet\shared\Microsoft.AspNetCore.App]
  Microsoft.AspNetCore.App 2.2.7 [C:\Program Files\dotnet\shared\Microsoft.AspNetCore.App]
  Microsoft.AspNetCore.App 2.2.8 [C:\Program Files\dotnet\shared\Microsoft.AspNetCore.App]
  Microsoft.AspNetCore.App 3.1.0 [C:\Program Files\dotnet\shared\Microsoft.AspNetCore.App]
  Microsoft.AspNetCore.App 3.1.1 [C:\Program Files\dotnet\shared\Microsoft.AspNetCore.App]
  Microsoft.NETCore.App 2.1.14 [C:\Program Files\dotnet\shared\Microsoft.NETCore.App]
  Microsoft.NETCore.App 2.2.7 [C:\Program Files\dotnet\shared\Microsoft.NETCore.App]
  Microsoft.NETCore.App 2.2.8 [C:\Program Files\dotnet\shared\Microsoft.NETCore.App]
  Microsoft.NETCore.App 3.1.0 [C:\Program Files\dotnet\shared\Microsoft.NETCore.App]
  Microsoft.NETCore.App 3.1.1 [C:\Program Files\dotnet\shared\Microsoft.NETCore.App]
  Microsoft.WindowsDesktop.App 3.1.0 [C:\Program Files\dotnet\shared\Microsoft.WindowsDesktop.App]
  Microsoft.WindowsDesktop.App 3.1.1 [C:\Program Files\dotnet\shared\Microsoft.WindowsDesktop.App]

To install additional .NET Core runtimes or SDKs:
  https://aka.ms/dotnet-download
Wraith2 commented 4 years ago

The System.Uri class seems to reject URI strings greater than a strangely short length (experimentally, 65520 or higher

I'm really not sure you can consider that short.

airbreather commented 4 years ago

The System.Uri class seems to reject URI strings greater than a strangely short length (experimentally, 65520 or higher

I'm really not sure you can consider that short.

Relative to the majority of use cases, I suppose not. "Short" was relative to hard limits of the system, given that this is the recommended type to use for data URIs and generated query strings; i.e., I understand unstable behavior brought about by strings approaching 2 GiB in length, but this is an artificial (edit: undocumented (edit-edit: actually just poorly-documented)) limitation that triggers well before we get to that range.

It's really data URIs that I'm concerned about here, since base64-encoded data streams hit this limit before the payload reaches 48 KiB.

scalablecory commented 4 years ago

I believe we have an issue for this already, but I can't find it.

CC @MihaZupan

MihaZupan commented 4 years ago

This is a documented length limit for Uri

See https://docs.microsoft.com/en-us/dotnet/api/system.uri.-ctor?view=netcore-3.1#System_Uri__ctor_System_String_

One of the conditions listed for throwing UriFormatException:

The length of uriString exceeds 65519 characters.

I agree that we can add a comment about this limit to the main Uri page as rn it is only documented on constructors.

MihaZupan commented 4 years ago

Changing the length limit on Uri is not a trivial change.

If you require more info about specific sections of really long data Uris, I would recommend writing a small parser specific to data uris, as the format seems trivial to parse.

airbreather commented 4 years ago

If you require more info about specific sections of really long data Uris, I would recommend writing a small parser specific to data uris, as the format seems trivial to parse.

FWIW, the issue came up when a user of my library was trying to use a "dumb" record type that happens to contain a URI. I'm not directly using the services that System.Uri adds on top of System.String, other than validating that the string is a valid URI.

My data model uses System.Uri primarily because the following FxCop rules document that it is best-practice to strongly favor System.Uri over System.String for all API members that represent URIs:

  1. CA1054:UriParametersShouldNotBeStrings
  2. CA1055:UriReturnValuesShouldNotBeStrings
  3. CA1056:UriPropertiesShouldNotBeStrings

Those pages all include explanations along the lines of:

A string representation of a URI is prone to parsing and encoding errors, and can lead to security vulnerabilities. The System.Uri class provides these services in a safe and secure manner.

I can, of course, work around this issue by exposing an alternative way to create an instance of the GpxWebLink class that supports pass-through for an overlong URI, but that means exposing the user to all of the bad things that the quoted text above warns about. I really like to incorporate concepts of "fail-fast" and "pit of success" into my API designs, and the workaround is the antithesis of both.

GrabYourPitchforks commented 4 years ago

Is it the intent of your data model that your System.Uri field be able to handle arbitrary file uploads? I realize URLs theoretically can be of arbitrary length, but in practice I imagine the majority of Uri's consumers thinking of it as "what goes into the address bar in the web browser?"

airbreather commented 4 years ago

Is it the intent of your data model that your System.Uri field be able to handle arbitrary file uploads?

Not really, no.

The library defines a .NET object model that matches the XML schema for GPX files. The type in question is the equivalent of linkType in the XML schema.

The idea is that the library can be used for either:

  1. Parsing an existing file into the equivalent representation in the .NET object model, or
  2. Writing the .NET representation of a GPX file to a new file

It's part of the NetTopologySuite project, so this object model representation isn't necessarily the beginning or the end of the chain (there are converters to go between this direct object model translation and the standard NetTopologySuite geometry types), but some people find enough value in just being able to interact with GPX objects more directly.

From what I gather from IsraelHikingMap/Site#1128 and NetTopologySuite/NetTopologySuite.IO.GPX#39:

I realize URLs theoretically can be of arbitrary length, but in practice I imagine the majority of Uri's consumers thinking of it as "what goes into the address bar in the web browser?"

Of course -- I'm sure that's why I can't seem to actually find other widespread complaints about this from the fundamental System.Uri side of things, despite this type being around for so long (though I have found evidence of people running into issues that flow from this same constant, e.g., #14674).

HarelM commented 4 years ago

As explained by @airbreather the site uses NetTopologySuite.IO.GPX to serialize files. We want to store the content of an image inside this GPX file by using a base64 string to represent the image. Creating multiple files is not part of the GPX sceme and opening this file offline should present the image. The solution we decided was to use the URI field to store this base64 data, which as far as I understand is a valid value. Bottom line: URL="data:image/jpg;base64, dksljfhalskjdfh... is our use case.

karelz commented 4 years ago

Duplicate of #14674

karelz commented 4 years ago

Fixed in 5.0 in PR https://github.com/dotnet/corefx/pull/41686 Please try 5.0 and let us know if you still face the problem

MihaZupan commented 4 years ago

@karelz The length limit on the Uri constructor remains the same, that change only adresses the static (Un)Escape* methods since those do not allocate a Uri.

Removing the length limit would be possible, tho a breaking change if someone relies on the ctor for length validation. It would also be a performance regression (larger allocation for the Uri object), as we currently store indexes as ushorts in a packed struct (hence the ~2^16 length limit). Seeing as there isn't much demand for Uris that long I don't think it is worth doing it.

airbreather commented 4 years ago

Duplicate of #14674 Fixed in 5.0 in PR dotnet/corefx#41686 Please try 5.0 and let us know if you still face the problem

@karelz my comment is probably unnecessary after @MihaZupan's, but just to confirm... the problem is still there. I did a shallow clone of d9c448e, then ran build.cmd, then re-ran the repro with the obvious tweaks (netcoreapp5.0, incremented TextLength in Program.cs).

Here's the console output from that run (irrelevant path details collapsed to just (...)):

C:\(...)\dotnet5>dotnet --info
C:\(...)\dotnet5\.dotnet
.NET Core SDK (reflecting any global.json):
 Version:   5.0.100-preview.1.20112.7
 Commit:    db3a77a3f6

Runtime Environment:
 OS Name:     Windows
 OS Version:  10.0.18363
 OS Platform: Windows
 RID:         win10-x64
 Base Path:   C:\(...)\dotnet5\.dotnet\sdk\5.0.100-preview.1.20112.7\

Host (useful for support):
  Version: 5.0.0-alpha.1.20112.1
  Commit:  673635654c

.NET Core SDKs installed:
  5.0.100-preview.1.20112.7 [C:\(...)\dotnet5\.dotnet\sdk]

.NET Core runtimes installed:
  Microsoft.AspNetCore.App 5.0.0-preview.1.20111.6 [C:\(...)\dotnet5\.dotnet\shared\Microsoft.AspNetCore.App]
  Microsoft.NETCore.App 5.0.0-alpha.1.20112.1 [C:\(...)\dotnet5\.dotnet\shared\Microsoft.NETCore.App]
  Microsoft.WindowsDesktop.App 5.0.0-preview.1.20112.4 [C:\(...)\dotnet5\.dotnet\shared\Microsoft.WindowsDesktop.App]

To install additional .NET Core runtimes or SDKs:
  https://aka.ms/dotnet-download

C:\(...)\dotnet5>dotnet run --project ..\ConsoleApp0\ConsoleApp0.csproj
C:\(...)\dotnet5\.dotnet
Unhandled exception. System.UriFormatException: Invalid URI: The Uri string is too long.
   at System.Uri.CreateThis(String uri, Boolean dontEscape, UriKind uriKind)
   at System.Uri..ctor(String uriString)
   at Program.Main() in C:\(...)\ConsoleApp0\Program.cs:line 10
airbreather commented 4 years ago

Removing the length limit [...] would also be a performance regression (larger allocation for the Uri object), as we currently store indexes as ushorts in a packed struct (hence the ~2^16 length limit).

Sorry in advance for what's probably a naïve analysis, I only took a brief look at the data layout to get a feel for what's behind this, and I had a thought. Also, I'm a sucker for performance-related puzzles.

MoreInfo looks like it's implementing a pattern that I've used before when storing property values that aren't needed in performance-sensitive cases; is that right?

If so, perhaps we could continue to store Offset values directly on the UriInfo for instances that fit within the current limits, but support longer strings by adding some uint properties onto MoreInfo, adding yet another flag to Flags to indicate when this happens?

A sketch of the idea:

[Flags]
private enum Flags : ulong
{
    Zero = 0x00000000,

    /* existing unrelated things ... */
    VeryLongPath = 0x200000000000,
}

private Flags _flags;
private UriInfo _info = null!; // initialized during ctor via helper

private class UriInfo
{
    /* existing unrelated things ... */
    public Offset Offset;
    /* existing unrelated things ... */
    public MoreInfo? MoreInfo;     // Multi-threading: This field must be always accessed through a _local_
                                   // stack copy of m_Info.
};

[StructLayout(LayoutKind.Sequential, Pack = 1)]
private struct Offset
{
    public ushort Scheme;
    public ushort User;
    public ushort Host;
    public ushort PortValue;
    public ushort Path;
    public ushort Query;
    public ushort Fragment;
    public ushort End;
};

private class MoreInfo
{
    /* existing unrelated things ... */
    public uint Offset_Scheme;
    public uint Offset_User;
    public uint Offset_Host;
    public uint Offset_PortValue;
    public uint Offset_Path;
    public uint Offset_Query;
    public uint Offset_Fragment;
    public uint Offset_End;
};

private bool IsVeryLongPath
{
    get { return (_flags & Flags.VeryLongPath) != 0; }
}

private uint Offset_Scheme
{
    // assumption: we allocate MoreInfo as soon as we set the VeryLongPath flag
    get { return IsVeryLongPath ? _info.MoreInfo!.Offset_Scheme : _info.Offset.Scheme; }
    set
    {
        if (IsVeryLongPath)
        {
            _info.MoreInfo!.Offset_Scheme = value;
        }
        else
        {
            _info.Offset.Scheme = checked((ushort)value);
        }
    }
}

private uint Offset_User
{
    // assumption: we allocate MoreInfo as soon as we set the VeryLongPath flag
    get { return IsVeryLongPath ? _info.MoreInfo!.Offset_User : _info.Offset.User; }
    set
    {
        if (IsVeryLongPath)
        {
            _info.MoreInfo!.Offset_User = value;
        }
        else
        {
            _info.Offset.User = checked((ushort)value);
        }
    }
}

/* etc. */
/* and then, of course, change callers that use _info.Offset */

It could even be taken a little bit further, changing Offset to be LayoutKind.Explicit and storing half of the double-wide offsets directly, only going to MoreInfo for the other half of the offset values.

If I'm reading it right, the performance cost becomes:

My intuition is that these penalties are not too significant, given how much memory an instance of System.Uri already takes up.

In case my intuition is wrong, another possible way of doing this without storing extra data per instance would be to continue to store offsets in ushort values the way we do when the values are short enough, but for offsets >65519, store a sentinel value of 65535 indicating that the "real" value lies somewhere beyond. When we need to access such an offset, we would need to re-scan the original string to figure it out.

GrabYourPitchforks commented 4 years ago

Reopened per https://github.com/dotnet/runtime/issues/1857#issuecomment-590012273 since it was incorrectly resolved as a dupe. The NCL team may still want to resolve it as "won't fix" at their discretion, but it's not a dupe.

karelz commented 4 years ago

Sorry, I assumed the static methods are used in constructor as well. I was wrong. Thanks for reopening @GrabYourPitchforks.

Removing the length limit would be possible, tho a breaking change if someone relies on the ctor for length validation. It would also be a performance regression (larger allocation for the Uri object), as we currently store indexes as ushorts in a packed struct (hence the ~2^16 length limit). Seeing as there isn't much demand for Uris that long I don't think it is worth doing it.

I don't mind the breaking change, but the perf impact is undesired given rare demand. I recommend to close it as "Won't fix" -- recommending workaround to use the static methods without the limit. @MihaZupan if you agree (which it seems you are), please resolve it as such.

MihaZupan commented 4 years ago

MoreInfo looks like it's implementing a pattern that I've used before when storing property values that aren't needed in performance-sensitive cases; is that right?

Correct

In case my intuition is wrong, another possible way of doing this without storing extra data per instance would be to continue to store offsets in ushort values the way we do when the values are short enough, but for offsets >65519, store a sentinel value of 65535 indicating that the "real" value lies somewhere beyond. When we need to access such an offset, we would need to re-scan the original string to figure it out.

65535 is also a valid length that could be stored if the input is shorter than 65520 and has characters that would expand it to 65535.


If we were to implement it, it would most likely be done by changing the minimum amount of fields to int (we could likely keep Scheme, User, Host, PortValue and Path as ushort). That would result in a 4-8 byte size increase of Uri, which might be acceptable considering other inefficiencies in Uri.



Considering that data Uris are the only realistic scenario for Uris longer than 65k, I don't think it is worth making such a change.

I understand that in your case you may not be interacting with the created Uri other than for validation / satisfying the object model. But otherwise the utility of Uri for data uris is practically equivalent to

bool IsValid(string uriString) =>
    uriString.StartsWith("data:", StringComparison.OrdinalIgnoreCase);

It provides no utility for accesing the media type/data - those have to be manually parsed out. It also provides no validation for it except for the scheme name.

There are other downsides to removing the length restriction.

So if for some reason we decided to remove the length restriction in the future, a lot of work would have to be done to improve Uri overall first.

Given that this issue is the only ask for relaxing the length restriction that I know of and you already implemented a workaround for your use case, I don't see a real benefit for moving forward here.

Consider as won't-fix until then.

airbreather commented 4 years ago

In my original comment, I wrote this:

I totally understand why an artificial restriction might appear as a result of engineering trade-offs, so perhaps a decent first step might be to document it better?

Should open an issue on dotnet/dotnet-api-docs for that "decent first step" part of it?

As of right now, the "Filing Issues" section of this repo doesn't link over there (I only found out about it when scrolling through and noticing the cross-reference from dotnet/dotnet-api-docs#3906).

MihaZupan commented 4 years ago

I suppose we could extend the scope of dotnet/dotnet-api-docs#3906 to cover this as well

davidsh commented 4 years ago

I suppose we could extend the scope of dotnet/dotnet-api-docs#3906 to cover this as well

Things like this can go into the 'Remarks' section of the relevant API methods/properties of Uri.