Open JeremyKuhne opened 4 years ago
oops missed this, copying my comment from the PR:
Ultimately we should remove all non-Unicode codePaths in RichTextBox. This should be possible as we're force loading the latest RichEdit.
I have been wondering if thats possible, the docs look like it isn't (it says SF_UNICODE
only should be combined with SF_TEXT
, but loading RTF requires using SF_RTF
). I'm not entirely sure whether RTF is a binary format or a text format, if its only a binary format that happens to support roundtrip via text this means encoding is part of the header (wordpad certainly saves an encoding in the RTF header but I don't know what it is used for). Since the RTF is user provided you can't change the header without touching the rest of the data and I don't think WinForms wants to parse and reencode the RTF stream. I'd be happy to be mistaken though in case RTF is not only a binary format but also a text format (i.e. its code points are not restricted to bytestreams and its just badly documented).
I have been wondering if thats possible
I'm not positive, the RichEdit code isn't particularly easy to follow. :/ I think you can use UTF-8 on EM_STREAMIN
like you do for EM_STREAMOUT
using SF_USECODEPAGE
, but I haven't tested to validate that. The RTF instructions themselves look like they have to be 1 byte ASCII.
@weltkante I gave it a shot and it appears to work with UTF-8. Changing EM_STREAMIN
and EM_STREAMIN
to use SF_USECODEPAGE
with CP_UTF8
(shifted per the docs) allowed me round trip with expected results. I set the control's Text
to どうもありがとうミスターロボット
, got the following Rtf
out, and put it back in with no problems:
{\urtf1\ansi\ansicpg1252\deff0\nouicompat\deflang1033{\fonttbl{\f0\fnil\fcharset128 Yu Gothic;}{\f1\fnil Segoe UI;}}
{\*\generator Riched20 10.0.18362}\viewkind4\uc1
\pard\f0\fs18 どうもありがとうミスターロボット\f1\par
}
Text
showed the right results after setting the rtf back to Rtf
.
I'd say it is an easy call to change the input to always push in as UTF-8. Output isn't so obvious as it would likely break interop scenarios (other code wouldn't be able to read the output). We might want to consider adding another getter, perhaps string Utf8Rtf { get; }
.
I haven't tried UTF-16, but I expect it not to work as my cursory read of the RichText code seems to indicate it depends on ASCII bytes for control (which UTF-8 gives you for code points < 127).
RichTextBox
uses the latest RichEdit and as such should be able to remove the code paths where we're grabbingEncoding
and converting back and forth from Unicode to ASCII. If we're using Unicode everywhere we can be more efficient and avoid conversion headaches.See #3032 for an example of where this happens.