bbottema / rtf-to-html

RTF to HTML conversion done right
8 stars 5 forks source link

RTF2HTMLConverterRFCCompliant doesn't handle mixed charsets #9

Closed Yspadadden closed 3 years ago

Yspadadden commented 3 years ago

In cases where a .rtf contains text in different languages that use different charset codepages, only the main codepage is used for decoding of characters.

This is how it should look:

Screenshot 2021-09-19 at 10 39 00

This is how it actually looks:

Screenshot 2021-09-19 at 10 40 52

In this particular case, we have the following font definition in the fonttable: {\f6\fswiss\fcharset128 "MS Mincho";} Charset 128 in rtf parlance is (according to wikipedia) Windows-932: Japanese.

The text itself then contains {\f6\fs24\htmlrtf0 \'8d\'9f\'92\'76\'8c\'68\'97\'e7\htmlrtf\f5} which references \f6 in the font table and tells us to interpret the bytes in codepage Windows-932. However, the code only uses the main Codepage of the document, which in this case is \ansicpg1252 or "Latin Alphabet, Western Europe / Americas".

The solution would be to read the font table definitions and map the definitions to a Charset. Then use this Charset, falling back to the main Charset of the document, if we don't have a specific one for this font. I will attempt to fix this in a pull request.

bbottema commented 3 years ago

Great bug report! I see that you're doing excellent work in your branch already.