dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
15.18k stars 4.72k forks source link

Uri seems to incorrectly escape local paths containing certain Unicode characters #89538

Open tmat opened 1 year ago

tmat commented 1 year ago

Description

There is a difference in the value of LocalPath property depending on whether the Uri instance is created with an absolute local path vs. URL with file scheme. For example, character U+E25B is escaped in the former and not in the latter case.

Reproduction Steps

Console.WriteLine(new Uri("c:\\\ue25b").LocalPath);
Console.WriteLine(new Uri("file://c:/\ue25b").LocalPath);

Expected behavior

c:\
c:\

Actual behavior

c:\%EE%89%9B
c:\

Regression?

No

Known Workarounds

 public static Uri CreateAbsoluteUri(string absolutePath)
 {
        return new Uri(IsAscii(absolutePath) ? absolutePath : GetAbsoluteUriString(), UriKind.Absolute);

        static string GetAbsoluteUriString()
        {
            var parts = absolutePath.Split([Path.DirectorySeparatorChar, Path.AltDirectorySeparatorChar]);

            if (Path.DirectorySeparatorChar == '/')
            {
                // Unix path: first part is empty, all parts should be escaped
                return "file://" + string.Join("/", parts.Select(EscapeUriPart));
            }

            if (parts is ["", "", var serverName, ..])
            {
                // UNC path: first non-empty part is server name and shouldn't be escaped
                return "file://" + serverName + "/" + string.Join("/", parts.Skip(3).Select(EscapeUriPart));
            }

            // Drive-rooted path: first part is "C:" and shouldn't be escaped
            return "file:///" + parts[0] + "/" + string.Join("/", parts.Skip(1).Select(EscapeUriPart));

            static string EscapeUriPart(string stringToEscape)
                => Uri.EscapeUriString(stringToEscape).Replace("#", "%23");
        }
}

Configuration

8.0.100-preview.6.23330.14 but also on .NET Framework.

Other information

No response

ghost commented 1 year ago

Tagging subscribers to this area: @dotnet/ncl See info in area-owners.md if you want to be subscribed.

Issue Details
### Description There is a difference in the value of `LocalPath` property depending on whether the `Uri` instance is created with an absolute local path vs. URL with `file` scheme. For example, character U+E25B is escaped in the former and not in the latter case. ### Reproduction Steps ``` Console.WriteLine(new Uri("c:\\\ue25b").LocalPath); Console.WriteLine(new Uri("file://c:/\ue25b").LocalPath); ``` ### Expected behavior ``` c:\ c:\ ``` ### Actual behavior ``` c:\%EE%89%9B c:\ ``` ### Regression? No ### Known Workarounds _No response_ ### Configuration 8.0.100-preview.6.23330.14 but also on .NET Framework. ### Other information _No response_
Author: tmat
Assignees: -
Labels: `area-System.Net`, `untriaged`
Milestone: -
poizan42 commented 1 year ago

FWIW this seems to be the opposite of the problem reported at #23738

karelz commented 1 year ago

@poizan42 this is specifically about characters in PUA ranges in Unicode. The bug you referred was about encoding the character % (see https://github.com/dotnet/runtime/issues/23738#issuecomment-334546994), so not quite related.

karelz commented 1 year ago

@tmat can you please share details how you ran into it? (we have seen 2 similar internal reports recently)

karelz commented 1 year ago

Triage: The behavior here is inconsistent between implicit path and "file://" path. We should unify it. Let's try to do it opportunistically in 8.0 as it might impact GB18030 compliance (although we are still investigating if it is must have or not for being compliant).

tmat commented 1 year ago

This issue in .NET Framework is now blocking VS GB18030 compliance.

karelz commented 1 year ago

@MihaZupan I updated known workaround in top post based on offline discussion with @tmat. Can you please check it and comment on it?

tmat commented 1 year ago

I have updated the workaround to make it work for UNC paths as well.

tmat commented 1 year ago

Note that the workaround needs to use Uri.EscapeUriString, which is obsolete :(. It does not work correctly when using Uri.EscapeDataString. Although, Uri.EscapeUriString does not entirely work either. it does not escape # so URIs created using the workaround will have different AbsoluteUri:

> new Uri("file://C:/" + Uri.EscapeDataString("!$&'()+,-;=@[]_~#")).AbsoluteUri
"file:///C:/%21%24%26%27%28%29%2B%2C-%3B%3D%40%5B%5D_~%23"
> new Uri("file://C:/" + Uri.EscapeUriString("!$&'()+,-;=@[]_~#")).AbsoluteUri
"file:///C:/!$&'()+,-;=@[]_~#"
> new Uri("C:\\!$&'()+,-;=@[]_~#").AbsoluteUri
"file:///C:/!$&'()+,-;=@[]_~%23"

There is also System.Net.WebUtility.UrlEncode that escapes yet another set of characters.

> new Uri("file://C:/" + System.Net.WebUtility.UrlEncode("!$&'()+,-;=@[]_~#")).AbsoluteUri
"file:///C:/!%24%26%27()%2B%2C-%3B%3D%40%5B%5D_~%23"

Perhaps the best way would be to implment a new Uri API that follows the standard and does not have back-compat quirks. We need a .NET implementation that matches TypeScript implementation: https://github.com/Microsoft/vscode-uri, i.e. URI created in one can be parsed an intepreted in the other and vice versa.

MihaZupan commented 1 year ago

A few more interesting examples:

Console.WriteLine(new Uri("C:\\A%20C").LocalPath);   // C:\A%20C -- note how %20 remained escaped
Console.WriteLine(new Uri("C:\\A%20%43").LocalPath); // C:\A C   -- note how %20 was unescaped
Console.WriteLine(new Uri("C:\\A%42C").LocalPath);   // C:\ABC

I'm afraid that the underlying issue here is similar to #1487, #79198, or #72632. When encountering a non-reserved character that was already escaped (%42 or %43 above), or any non-ASCII character (even if escaped), Uri will recreate the input string internally. This can lead to inconsistencies down the road (like how %20 may or may not be escaped in the above example).

Your workaround seems to be doing a good job of fighting this internal (un)escaping and re-(un)escaping. If you just want to workaround the issue with LocalPath, then EscapeDataString would be more appropriate. But if you also care about AbsoluteUri, I don't think you'll find a perfect workaround (outside of fixing the underlying issue in Uri).

var i = (Path.DirectorySeparatorChar == '/') || !filePath.StartsWith(@"\\") ? 0 : 2;

To better match how Uri does UNC detection, this could be

filePath.Length >= 2 && filePath[0] is '/' or '\\' && filePath[1] is '/' or '\\'

(we'll match it on non-Windows as well, with any combination of / or \)

karelz commented 1 year ago

Triage: Given that it is (again and not surprisingly) more complicated than we thought, we should not risk regressing even more 8.0. Let's try to decide in next release what to do about it.

tmat commented 1 year ago

@MihaZupan I ended up writing custom escaping logic to work around issues with EscapeDataString, EscapeUriString.

The examples with unescaping %20 above are not an issue for us. The input is either an absolute local path (% is an ordinary character in local path, not an escape), or a well-formed URL string (all characters that should be escaped are already escaped).

We preprocess the input string only if it is an absolute local path. The preprocessing makes sure the input to Uri constructor is a well-formed URL. If the input is URL string already, we pass it to Uri constructor as is.

Re UNC: yes, I updated the logic to check for both slashes.

See https://github.com/dotnet/roslyn/blob/25c959b72b9ef9f432b952c4579015aad283cd5a/src/Features/LanguageServer/Protocol/Extensions/ProtocolConversions.cs#L169 for full implementation of the workaround.

I tested this impl against VS Code and it works well.

tmat commented 10 months ago

GetComponents(UriComponents.Path, UriFormat.SafeUnescaped) is also misbehaving for PUA characters, imo:

> new Uri("https://a.com/𫚭").GetComponents(UriComponents.Path, UriFormat.SafeUnescaped)
"𫚭%EE%89%9B"

I'd expect the result to be 𫚭.

Eilon commented 4 months ago

@tmat / @karelz / @MihaZupan - FYI I ran into this as well during a recent Chinese GB18030 testing pass. A tester was using a string containing certain characters (that happen to become part of a CSPROJ file name). Then XDocument.Load() and related APIs are used to read that CSPROJ, and that's when I noticed this issue.