dotnet / winforms

Windows Forms is a .NET UI framework for building Windows desktop applications.
MIT License
4.36k stars 967 forks source link

Should always use Unicode in RichTextBox #3101

Open JeremyKuhne opened 4 years ago

JeremyKuhne commented 4 years ago

RichTextBox uses the latest RichEdit and as such should be able to remove the code paths where we're grabbing Encoding and converting back and forth from Unicode to ASCII. If we're using Unicode everywhere we can be more efficient and avoid conversion headaches.

See #3032 for an example of where this happens.

weltkante commented 4 years ago

oops missed this, copying my comment from the PR:

Ultimately we should remove all non-Unicode codePaths in RichTextBox. This should be possible as we're force loading the latest RichEdit.

I have been wondering if thats possible, the docs look like it isn't (it says SF_UNICODE only should be combined with SF_TEXT, but loading RTF requires using SF_RTF). I'm not entirely sure whether RTF is a binary format or a text format, if its only a binary format that happens to support roundtrip via text this means encoding is part of the header (wordpad certainly saves an encoding in the RTF header but I don't know what it is used for). Since the RTF is user provided you can't change the header without touching the rest of the data and I don't think WinForms wants to parse and reencode the RTF stream. I'd be happy to be mistaken though in case RTF is not only a binary format but also a text format (i.e. its code points are not restricted to bytestreams and its just badly documented).

JeremyKuhne commented 4 years ago

I have been wondering if thats possible

I'm not positive, the RichEdit code isn't particularly easy to follow. :/ I think you can use UTF-8 on EM_STREAMIN like you do for EM_STREAMOUT using SF_USECODEPAGE, but I haven't tested to validate that. The RTF instructions themselves look like they have to be 1 byte ASCII.

JeremyKuhne commented 4 years ago

@weltkante I gave it a shot and it appears to work with UTF-8. Changing EM_STREAMIN and EM_STREAMIN to use SF_USECODEPAGE with CP_UTF8 (shifted per the docs) allowed me round trip with expected results. I set the control's Text to どうもありがとうミスターロボット, got the following Rtf out, and put it back in with no problems:

{\urtf1\ansi\ansicpg1252\deff0\nouicompat\deflang1033{\fonttbl{\f0\fnil\fcharset128 Yu Gothic;}{\f1\fnil Segoe UI;}}
{\*\generator Riched20 10.0.18362}\viewkind4\uc1 
\pard\f0\fs18 どうもありがとうミスターロボット\f1\par
}

Text showed the right results after setting the rtf back to Rtf.

I'd say it is an easy call to change the input to always push in as UTF-8. Output isn't so obvious as it would likely break interop scenarios (other code wouldn't be able to read the output). We might want to consider adding another getter, perhaps string Utf8Rtf { get; }.

I haven't tried UTF-16, but I expect it not to work as my cursory read of the RichText code seems to indicate it depends on ASCII bytes for control (which UTF-8 gives you for code points < 127).