dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
14.95k stars 4.65k forks source link

Xslt document function fails when path Contains Private Area Unicode Characters #59353

Open benvillalobos opened 2 years ago

benvillalobos commented 2 years ago

Description

Original issue: https://github.com/dotnet/msbuild/issues/6847 MSBuild's PR: https://github.com/dotnet/msbuild/pull/6863

Initializing an XmlReader using XmlReader.Create(<path-to-file>) where the path contains characters from the unicode private use area causes issues at file load. Part of the path that's the culprit: Ⅻㄨㄩ 啊阿鼾齄丂丄狚狛狜狝﨨﨩ˊˋ˙–⿻〇㐀㐁䶴䶵U1[]U2[]U3[]

The workaround is to call XmlReader.Create using a StreamReader. However, when MSBuild is calling its XslTransform task on a file in this type of path, and in the xsl file it calls the Document function on a file that also exists in a path like this, it will break. See details here: https://github.com/dotnet/msbuild/pull/6863#issuecomment-922089397

This comment explains the issue. You hit an exception when calling the document function in an xslt file when that xslt file is under a path that contains private area unicode characters.

Xslt file example:

<?xml version="1.0" encoding="UTF-8"?>

<xsl:stylesheet version="1.0"

xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

<xsl:template match="/">

<xsl:value-of select="document('b.xml')"/>
</xsl:template>

</xsl:stylesheet>

Place that next to a b.xml in a path that contains Ⅻㄨㄩ 啊阿鼾齄丂丄狚狛狜狝﨨﨩ˊˋ˙–⿻〇㐀㐁䶴䶵U1[]U2[]U3[].

Configuration

Full framework & net core 6.0.100-preview.7.21379.14

Regression?

No

Other information

See original issue / MSBuild PR for in-depth notes.

ghost commented 2 years ago

Tagging subscribers to this area: @buyaa-n, @krwq See info in area-owners.md if you want to be subscribed.

Issue Details
### Description Original issue: https://github.com/dotnet/msbuild/issues/6847 MSBuild's PR: https://github.com/dotnet/msbuild/pull/6863 Initializing an XmlReader using `XmlReader.Create()` where the path contains characters from the unicode private use area causes issues at file load. Part of the path that's the culprit: `Ⅻㄨㄩ 啊阿鼾齄丂丄狚狛狜狝﨨﨩ˊˋ˙–⿻〇㐀㐁䶴䶵U1[]U2[]U3[]` The workaround is to call XmlReader.Create using a StreamReader. However, when MSBuild is calling its `XslTransform` task on a file in this type of path, and in the xsl file it calls the `Document` function on a file that also exists in a path like this, it will break. See details here: https://github.com/dotnet/msbuild/pull/6863#issuecomment-922089397 ### Configuration Full framework & net core 6.0.100-preview.7.21379.14 ### Regression? No ### Other information See original issue / MSBuild PR for in-depth notes.
Author: BenVillalobos
Assignees: -
Labels: `area-System.Xml`, `untriaged`
Milestone: -
krwq commented 2 years ago

@BenVillalobos can you provide a minimal repro? Ideally something which creates a file on the fly and uses \uXYZV notation for non-english characters

string path = @"Ⅻㄨㄩ 啊阿鼾齄丂丄狚狛狜狝﨨﨩ˊˋ˙–⿻〇㐀㐁䶴䶵.xml";
string content = @"<root></root>";
File.WriteAllText(path, content);
using XmlReader reader = XmlReader.Create(path);

and getting no exception.

What's the exception you're getting?

benvillalobos commented 2 years ago

You'll need to include U1[]U2[]U3[] as part of the path. Adding that and using your code in an empty console proj caused the repro for me.

krwq commented 2 years ago

@BenVillalobos what is the exception you're seeing?

krwq commented 2 years ago

from the e-mail conversation this might be blocking msbuild team so changing milestone

benvillalobos commented 2 years ago

OH, apologies. The title for this issue was incorrect. See this comment for details: https://github.com/dotnet/msbuild/pull/6863#issuecomment-922089397

Exception gets thrown on this line https://github.com/dotnet/msbuild/blob/main/src/Tasks/XslTransformation.cs#L176

In summary, when transforming an xslt doc who's path contains the unicode mess (Ⅻㄨㄩ 啊阿鼾齄丂丄狚狛狜狝﨨﨩ˊˋ˙–⿻〇㐀㐁䶴䶵U1[]U2[]U3[]), you'll hit an XslTransformException if that xslt calls the document function to find another xml file next to it.

benvillalobos commented 2 years ago

Also this isn't a blocking issue. We worked around one aspect of the issue which was loading the xslt doc in the first place. But we also see the error when trying to load another document while transforming the first.

krwq commented 2 years ago

@BenVillalobos any chance you could create a project with a tiny repro?

benvillalobos commented 2 years ago

proj1.zip

Steps to reproduce: Download zip, create a folder next to proj1.csproj called Ⅻㄨㄩ 啊阿鼾齄丂丄狚狛狜狝﨨﨩ˊˋ˙–⿻〇㐀㐁䶴䶵U1[]U2[]U3[]. Drag transform.xslt and b.xml into it.

Note: If you have the latest and greatest VS, it probably already has this change. So msbuild <pathtoproj1>.csproj on a dev command prompt should suffice. If not, do the steps below:

git clone https://github.com/dotnet/msbuild
cd msbuild
build.cmd /p:CreateBootstrap=true
cd artifacts\bin\bootstrap\net472\MSBuild\Current\Bin
msbuild.exe <pathtoproj1.csproj>

You should get the error error MSB3703: Unable to execute transformation. An error occurred while loading document 'b.xml'. See InnerException for a complete description of the error.. After you've confirmed the repro, debug into it to see the inner exception.

Open up MSBuild.Dev.slnf in VS.

On same command line: set MSBUILDDEBUGONSTART=1 msbuild.exe

Place a breakpoint in the line I mentioned the exception gets thrown (2 comments above)

bgrainger commented 2 years ago

IMO this seems to be caused by unexpected behaviour in System.Uri: specifically, a PUA character is not roundtripped through a file:/// URI for new Uri(x).LocalPath.

This test program tests the Uri constructor with various characters that have different UTF-8 encoded lengths; the last is a PUA character:

foreach (var name in new[] { "a", "\u00A1", "\u0800", "\U00010000", "\uE000" })
{
    var filePath = $@"C:\Temp\{name}.txt";
    var uri = new Uri(filePath);
    Console.WriteLine("{0}\t{1}\t{2}\t{3}", uri.LocalPath == filePath, filePath, uri.AbsoluteUri, uri.LocalPath);
}

The output shows that the file path does round-trip for everything except the PUA character, which is double-encoded in the URI.

True   C:\Temp\a.txt  file:///C:/Temp/a.txt  C:\Temp\a.txt
True   C:\Temp\¡.txt  file:///C:/Temp/%C2%A1.txt  C:\Temp\¡.txt
True   C:\Temp\ࠀ.txt  file:///C:/Temp/%E0%A0%80.txt  C:\Temp\ࠀ.txt
True   C:\Temp\𐀀.txt  file:///C:/Temp/%F0%90%80%80.txt  C:\Temp\𐀀.txt
False  C:\Temp\.txt  file:///C:/Temp/%25EE%2580%2580.txt  C:\Temp\%EE%80%80.txt

I haven't consulted the URI RFC to determine whether this would be considered correct. You don't get the same failure to round-trip for $@"https://example.com/{name}.txt" so I suspect not.

If System.Uri can't be changed in this instance (for back-compat concerns), then XmlUriResolver and/or XmlDownloadManager might need to work around the double-encoding?

bgrainger commented 2 years ago

I wonder if System.Uri parsing for file paths goes through the IRI code path, which explicitly excludes PUA characters (as per https://datatracker.ietf.org/doc/html/rfc3987#section-2.2): https://github.com/dotnet/runtime/blob/c00b06826ca4d333ef69d51e523ef7bd309b8631/src/libraries/System.Private.Uri/src/System/IriHelper.cs#L16-L22

A fix would be to change this line of code as follows: https://github.com/dotnet/runtime/blob/c00b06826ca4d333ef69d51e523ef7bd309b8631/src/libraries/System.Private.Xml/src/System/Xml/XmlResolver.cs#L43

                        uri = new Uri(Path.GetFullPath(relativeUri!), new UriCreationOptions { DangerousDisablePathAndQueryCanonicalization = true });

However, I don't know the security implications of this, given the property name.

krwq commented 2 years ago

I guess, since you can pass your own XmlResolver possibly you can just derive and apply workaround in the new type. We should still figure out if this is XML or Uri issue. The Dangerous* setting doesn't sound like a good idea but I'm not sure what are the exact implications of that or why it was created.

@karelz do we have any Uri experts who could help with the questions above?