Open tmat opened 1 year ago
Tagging subscribers to this area: @dotnet/ncl See info in area-owners.md if you want to be subscribed.
Author: | tmat |
---|---|
Assignees: | - |
Labels: | `area-System.Net`, `untriaged` |
Milestone: | - |
FWIW this seems to be the opposite of the problem reported at #23738
@poizan42 this is specifically about characters in PUA ranges in Unicode.
The bug you referred was about encoding the character %
(see https://github.com/dotnet/runtime/issues/23738#issuecomment-334546994), so not quite related.
@tmat can you please share details how you ran into it? (we have seen 2 similar internal reports recently)
Triage: The behavior here is inconsistent between implicit path and "file://" path. We should unify it. Let's try to do it opportunistically in 8.0 as it might impact GB18030 compliance (although we are still investigating if it is must have or not for being compliant).
This issue in .NET Framework is now blocking VS GB18030 compliance.
@MihaZupan I updated known workaround in top post based on offline discussion with @tmat. Can you please check it and comment on it?
I have updated the workaround to make it work for UNC paths as well.
Note that the workaround needs to use Uri.EscapeUriString
, which is obsolete :(. It does not work correctly when using Uri.EscapeDataString
. Although, Uri.EscapeUriString
does not entirely work either. it does not escape #
so URIs created using the workaround will have different AbsoluteUri
:
> new Uri("file://C:/" + Uri.EscapeDataString("!$&'()+,-;=@[]_~#")).AbsoluteUri
"file:///C:/%21%24%26%27%28%29%2B%2C-%3B%3D%40%5B%5D_~%23"
> new Uri("file://C:/" + Uri.EscapeUriString("!$&'()+,-;=@[]_~#")).AbsoluteUri
"file:///C:/!$&'()+,-;=@[]_~#"
> new Uri("C:\\!$&'()+,-;=@[]_~#").AbsoluteUri
"file:///C:/!$&'()+,-;=@[]_~%23"
There is also System.Net.WebUtility.UrlEncode
that escapes yet another set of characters.
> new Uri("file://C:/" + System.Net.WebUtility.UrlEncode("!$&'()+,-;=@[]_~#")).AbsoluteUri
"file:///C:/!%24%26%27()%2B%2C-%3B%3D%40%5B%5D_~%23"
Perhaps the best way would be to implment a new Uri
API that follows the standard and does not have back-compat quirks.
We need a .NET implementation that matches TypeScript implementation: https://github.com/Microsoft/vscode-uri, i.e. URI created in one can be parsed an intepreted in the other and vice versa.
A few more interesting examples:
Console.WriteLine(new Uri("C:\\A%20C").LocalPath); // C:\A%20C -- note how %20 remained escaped
Console.WriteLine(new Uri("C:\\A%20%43").LocalPath); // C:\A C -- note how %20 was unescaped
Console.WriteLine(new Uri("C:\\A%42C").LocalPath); // C:\ABC
I'm afraid that the underlying issue here is similar to #1487, #79198, or #72632.
When encountering a non-reserved character that was already escaped (%42
or %43
above), or any non-ASCII character (even if escaped), Uri will recreate the input string internally.
This can lead to inconsistencies down the road (like how %20
may or may not be escaped in the above example).
Your workaround seems to be doing a good job of fighting this internal (un)escaping and re-(un)escaping.
If you just want to workaround the issue with LocalPath
, then EscapeDataString
would be more appropriate.
But if you also care about AbsoluteUri
, I don't think you'll find a perfect workaround (outside of fixing the underlying issue in Uri).
var i = (Path.DirectorySeparatorChar == '/') || !filePath.StartsWith(@"\\") ? 0 : 2;
To better match how Uri does UNC detection, this could be
filePath.Length >= 2 && filePath[0] is '/' or '\\' && filePath[1] is '/' or '\\'
(we'll match it on non-Windows as well, with any combination of /
or \
)
Triage: Given that it is (again and not surprisingly) more complicated than we thought, we should not risk regressing even more 8.0. Let's try to decide in next release what to do about it.
@MihaZupan I ended up writing custom escaping logic to work around issues with EscapeDataString
, EscapeUriString
.
The examples with unescaping %20
above are not an issue for us. The input is either an absolute local path (%
is an ordinary character in local path, not an escape), or a well-formed URL string (all characters that should be escaped are already escaped).
We preprocess the input string only if it is an absolute local path. The preprocessing makes sure the input to Uri
constructor is a well-formed URL. If the input is URL string already, we pass it to Uri
constructor as is.
Re UNC: yes, I updated the logic to check for both slashes.
See https://github.com/dotnet/roslyn/blob/25c959b72b9ef9f432b952c4579015aad283cd5a/src/Features/LanguageServer/Protocol/Extensions/ProtocolConversions.cs#L169 for full implementation of the workaround.
I tested this impl against VS Code and it works well.
GetComponents(UriComponents.Path, UriFormat.SafeUnescaped)
is also misbehaving for PUA characters, imo:
> new Uri("https://a.com/𫚭").GetComponents(UriComponents.Path, UriFormat.SafeUnescaped)
"𫚭%EE%89%9B"
I'd expect the result to be 𫚭
.
@tmat / @karelz / @MihaZupan - FYI I ran into this as well during a recent Chinese GB18030 testing pass. A tester was using a string containing certain characters (that happen to become part of a CSPROJ file name). Then XDocument.Load()
and related APIs are used to read that CSPROJ, and that's when I noticed this issue.
Description
There is a difference in the value of
LocalPath
property depending on whether theUri
instance is created with an absolute local path vs. URL withfile
scheme. For example, character U+E25B is escaped in the former and not in the latter case.Reproduction Steps
Expected behavior
Actual behavior
Regression?
No
Known Workarounds
Configuration
8.0.100-preview.6.23330.14 but also on .NET Framework.
Other information
No response