RichTextBox Rtf property wrong converting non ascii chars

dotnet / wpf

WPF is a .NET Core UI framework for building Windows desktop applications.

MIT License

7.02k stars 1.16k forks source link

RichTextBox Rtf property wrong converting non ascii chars #4011

Open MickaelBillet opened 3 years ago

MickaelBillet commented 3 years ago

.NET Core Version : 3.1.405
Windows version: Windows 10
Does the bug reproduce also in WPF for .NET Framework 4.8?: No (4.6.1)
Is this bug related specifically to tooling in Visual Studio (e.g. XAML Designer, Code editing, etc...)? No

Problem description: The RichTextBox cannot convert non ascii chars from my rtf string.

[string rtf ="{\rtf1\ansi\ansicpg1252\uc1\htmautsp\deff2{\fonttbl{\f0\fcharset0 Times New Roman;}{\f2\fcharset0 Arial;}} {\colortbl\red0\green0\blue0;\red255\green255\blue255;}\loch\hich\dbch\pard\plain\ltrpar\itap0{\lang32\fs30\f2\cf0 \cf0\qj\sl15\slmult0{\f2 {\ltrch entête}\li0\ri0\sa0\sb0\fi0\qj\sl15\slmult0\par}}}"

if (rtf.Length > 2) { FlowDocument flowDocument = new FlowDocument { LineHeight = 1, Language = XmlLanguage.GetLanguage(Thread.CurrentThread.CurrentUICulture.Name),
};

 using (MemoryStream stream = new MemoryStream(Encoding.UTF8.GetBytes(rtf)))
 {
TextRange text = new TextRange(flowDocument.ContentStart, flowDocument.ContentEnd);

if (stream.Length != 0)
{
      text.Load(stream, DataFormats.Rtf);
}

    text.ClearAllProperties();
 }

return flowDocument; }](url)

My RichTextbox display "EntÃªte ". Problem with the conversion of "ê" (non ASCII chars)

Actual behavior: My RichTextbox display "EntÃªte ". Problem with the conversion of "ê" (non ASCII chars)

Expected behavior: My RichTextbox display "Entête ". Problem with the conversion of "ê"

Minimal repro: Always

elyoh commented 3 years ago

As far as I understand, a valid rtf sequence must only consist of ASCII characters. The inclusion of a literal ê in the rtf string is therefore invalid. The ê should be escaped. I tried escaping as \'EA and \u234e which both give the correct visual representation on both .NET Framework 4.8 and .NET 5.0 (did not test .NET Core 3.1).

The invalid literal ê in the rtf string is handled differently on .NET 5.0 compared with .NET Framework, being mapped to Ãª rather than ê.

daisyTian0517 commented 3 years ago

I tested with changingentête to Ent\'eate for your rtf in .NET Core 3.1 app, it works well with showing Entête.

MickaelBillet commented 3 years ago

If you use this rtf string rtf ="{\rtf1\ansi\ansicpg1252\uc1\htmautsp\deff2{\fonttbl{\f0\fcharset0 Times New Roman;}{\f2\fcharset0 Arial;}} {\colortbl\red0\green0\blue0;\red255\green255\blue255;}\loch\hich\dbch\pard\plain\ltrpar\itap0{\lang32\fs30\f2\cf0 \cf0\qj\sl15\slmult0{\f2 {\ltrch entête}\li0\ri0\sa0\sb0\fi0\qj\sl15\slmult0\par}}}" with the framework .NET 4.6.1 it works, but not with .Net Core 3.1.

I know if you change entête to Ent\'eate it works, but I don't know that. In my rtf I have entête

elyoh commented 3 years ago

The stream supplied to TextRange.Load is converted back to a string: https://github.com/dotnet/wpf/blob/d49f8ddb889b5717437d03caa04d7c56819c16aa/src/Microsoft.DotNet.Wpf/src/PresentationFramework/System/Windows/Documents/TextRangeBase.cs#L1608

Then back to a byte array: https://github.com/dotnet/wpf/blob/d49f8ddb889b5717437d03caa04d7c56819c16aa/src/Microsoft.DotNet.Wpf/src/PresentationFramework/System/Windows/Documents/RtfToXamlReader.cs#L7823

However, the behaviour of Encoding.Default is different (link) on .NET Core (always returns UTF-8) compared to .NET Framework (which always returns the system's active code page). Therefore _rtfBytes is different on .NET Core.

I suspect this is probably the reason for the difference in behaviour.

MickaelBillet commented 3 years ago

Many thanks I suppose it's the reason of my bug

miloush commented 3 years ago

Have you tried whether registering the code page provider resolves the issue for you?

https://docs.microsoft.com/en-us/dotnet/api/system.text.codepagesencodingprovider?view=net-5.0

elyoh commented 3 years ago

@miloush I did try that to begin with but it did not have an effect. The identified issue arises from a call to Encoding.Default. This is not affected by registering the code page provider and always returns UTF-8 on .NET Core.

Root cause The issue that is due to the following paths which are eventually called from TextRange.Load.

First, the stream provided to TextRange.Load is converted to bytes here: https://github.com/dotnet/wpf/blob/d49f8ddb889b5717437d03caa04d7c56819c16aa/src/Microsoft.DotNet.Wpf/src/PresentationFramework/System/Windows/Documents/RtfToXamlReader.cs#L7823

On .NET Core this will always use UTF-8 whilst on .NET Framework this will be the default system code page. _rtfBytes will not always match on Core/Framework.

Later, CurrentCulture is used to obtain a code page: https://github.com/dotnet/wpf/blob/d49f8ddb889b5717437d03caa04d7c56819c16aa/src/Microsoft.DotNet.Wpf/src/PresentationFramework/System/Windows/Documents/RtfToXamlLexer.cs#L38-L39

This gives the same result on Framework/Core. Also no need to manually register the code page provider on Core. The Framework call to Encoding.GetCodePage link was replaced with InternalEncoding.GetEncoding which registers the code page provider on demand so as to emulate the Framework behaviour.

This code page is then used to interpret the earlier _rtfBytes. https://github.com/dotnet/wpf/blob/d49f8ddb889b5717437d03caa04d7c56819c16aa/src/Microsoft.DotNet.Wpf/src/PresentationFramework/System/Windows/Documents/RtfToXamlLexer.cs#L132

The bottom line is that Core uses a system code page to interpret UTF-8 bytes which gives the difference in behaviour seen by @MickaelBillet .

miloush commented 3 years ago

To be fair you are giving it an UTF-8 encoded stream that claims to be CP 1252. I would call "EntÃªte" to be the expected output here for .NET FW too, and the bug then is to be ignoring the \ansicpg1252 instruction in the RTF. It seems that the codepage instruction is supposed to be supported by RtfToXamlReader.HandleCodePageTokens, but converting from stream to string in TextRange.Load and back to bytes is rendering it pointless.

elyoh commented 3 years ago

you are giving it an UTF-8 encoded stream

This is actually irrelevant to the issue. This line correctly rountrips the stream back to the original string. https://github.com/dotnet/wpf/blob/d49f8ddb889b5717437d03caa04d7c56819c16aa/src/Microsoft.DotNet.Wpf/src/PresentationFramework/System/Windows/Documents/TextRangeBase.cs#L1608

If instead, the orignal user had registered the code page provider and supplied a CP1252 encoded stream as follows: new MemoryStream(Encoding.GetCodePage(1252).GetBytes(rtf)), the result would have been the same. The only time this would cause issues is if the string contained characters that could not be represented in the user's chosen code page, which does not apply in this case.

the bug then is to be ignoring the \ansicpg1252 instruction in the RTF

I don't believe this to be the issue either. RtfToXamlReader.HandleCodePageTokens updates the CodePage property in the RtfToXamlLexer which in turn sets the CurrentEncoding property. Therefore the \ansicpg1252 instruction is used to set the code page correctly. If no code page instructions were present in the file, RtfToXamlLexer would default to CultureInfo.CurrentCulture.TextInfo.ANSICodePage as specified in the constructor. https://github.com/dotnet/wpf/blob/d49f8ddb889b5717437d03caa04d7c56819c16aa/src/Microsoft.DotNet.Wpf/src/PresentationFramework/System/Windows/Documents/RtfToXamlLexer.cs#L297-L307

I would call "EntÃªte" to be the expected output here for .NET FW too

It really depends how the data was represented in the original rft file used to make the invalid rtf string in the OP. If the original rtf file were CP1252 bytes this unescaped ê would have been encoded 0xEA whilst if it were UTF-8 bytes, it would have been encoded 0xC3 0xAA. Framework will map both cases to ê on a most systems using with code page 1252 as default. Core will maps both cases to Ãª.

I doubt that a real Unicode aware rtf writer would output ê as 0xC3 0xAA. For a start, it would be a double byte UTF-8 codepoint in a file which is meant to only contain 7-bit ASCII. That would be a pretty serious bug.

However it is not uncommon to see unescaped ANSI code points in the range 0x80-0xFF in output from older rtf writers.

Whether by accident or design, Framework gives the correct visual representation for the more common case, despite the invalid input. It seems intentional. Otherwise the rtf lexer could simply reject bytes which are illegal in an rtf stream in the first place.

miloush commented 3 years ago

I was trying to suggest that what you see as "correct" is a either a lucky coincidence or lack of clarity on the input. It would help to either attach the rtf file, or a repro project, because currently it looks like you are reading it from a hard-coded string in a C# file and the compiler turns that into a Unicode string.

So if your code has string rtf = "ê"; and it is saved as CP1252 (i.e. 0xEA) then at runtime, the rtf string will be a proper Unicode "ê". Your Encoding.UTF8.GetBytes(rtf) turns it into 0xC3 0xAA. That already ruined the RTF data.

Then, the TextRangeBase.Load uses StreamReader to read the array, which detects the encoding as UTF-8. and converts the supplied stream into a string. Even if you passed 0xEA, the StreamReader would use the default codepage and turn it into the same Unicode string, even though these two should clearly be different RTFs.

Anyway, then the string is converted to bytes again as you pointed out, using Encoding.Default. This works for you on .NET Framework because coincidentally, the codepage declared in RTF (\ansicpg1252) matches Encoding.Default for you. It does not on .NET Core. But it is also broken in .NET Framework for any RTF that declares different codepage.

Now the RtfToXamlReader.HandleCodePageTokens wants to set the encoding to be used when interpreting the bytes. But that is no longer valid, the \\ansicpg1252 instruction or similar now does not reflect the reality, because we already used Encoding.Default above to get the bytes.

I think the fix here should be allowing you to pass a byte array directly to the RtfToXamlLexer (or StreamReader that would be read as bytes, not string). If you are allowed to pass a RTF as a string, then any encoding instructions in the RTF should be ignored with you being responsible for producing a correct string in the first place, and even then it can only work if a single encoding is used throughout the whole RTF.

Or am I missing something?

elyoh commented 3 years ago

Simple repo attached with RTF files. RtfTextNet5 [05-02-2021 18-53].zip RtfTest [05-02-2021 18-36].zip

Consider this example on Framework:

var flowDocument = new FlowDocument();
var text = new TextRange(flowDocument.ContentStart, flowDocument.ContentEnd);

// Step 1 - Read the original rtf file using the default encoding
var rtfString = File.ReadAllText("invalid.rtf", Encoding.Default);

// Step 2 - Convert to UTF-8 bytes
var rtfBytes = Encoding.UTF8.GetBytes(rtfString);

using (var stream = new MemoryStream(rtfBytes))
{
     text.Load(stream, DataFormats.Rtf);
}

Step 2 ensures that TextRange.Load use of StreamReader reverts the users call to Encoding.UTF8.GetBytes to give the same Unicode string generated in Step 1. Step 1 ensures that RtfToXamlReader use of Encoding.Default.GetBytes gives back the exact same bytes read from the original RTF file.

The result is that on Framework, RtfToXamlLexer always gets the bytes of the original RTF file when steps 1 and 2 are followed.

RtfToXamlLexer will then interpret bytes 0x80 to 0xFF according to the current codepage token applied to the section of the RTF file. In other words, it treats these bytes as though they were correctly escaped as \'xx hex codes. That is highly desirable behaviour and it is consistent with they way both Word and Wordpad handle these invalid files.

I think the fix here should be allowing you to pass a byte array directly to the RtfToXamlLexer

Its constructor already takes a byte array, but in the case of malformed RTF, the bytes that end up being passed here do not always match the original file. In Framework, it is possible to make that happen, but this is no longer possible in Core.

If you are allowed to pass a RTF as a string, then any encoding instructions in the RTF should be ignored with you being responsible for producing a correct string in the first place, and even then it can only work if a single encoding is used throughout the whole RTF.

This does not seem right. All the user should be doing is ensuring that the string contains only a valid subset of ASCII characters. Any escaped non-ASCII characters still need to be interpreted according to the codepage tokens applied to the section containing them.

In the following well formed RTF string, \'ea has different meanings. For the first occurence, it represents ê in CP1252 and for the second, к in CP1251. Repo includes this example.

{\rtf1\ansi\ansicpg1252\deff0\nouicompat\deflang2057{\fonttbl{\f0\fnil\fcharset0 Calibri;}{\f1\fnil\fcharset204 Calibri;}}
{\*\generator Riched20 10.0.19041}\viewkind4\uc1 
\pard\sa200\sl276\slmult1\f0\fs22\lang9\'ea in codepage 1252\par
\f1\lang1049\'ea\f0\lang2057  in codepage 1251\lang9\par
}