In cases where a .rtf contains text in different languages that use different charset codepages, only the main codepage is used for decoding of characters.
This is how it should look:
This is how it actually looks:
In this particular case, we have the following font definition in the fonttable: {\f6\fswiss\fcharset128 "MS Mincho";}
Charset 128 in rtf parlance is (according to wikipedia) Windows-932: Japanese.
The text itself then contains {\f6\fs24\htmlrtf0 \'8d\'9f\'92\'76\'8c\'68\'97\'e7\htmlrtf\f5} which references \f6 in the font table and tells us to interpret the bytes in codepage Windows-932. However, the code only uses the main Codepage of the document, which in this case is \ansicpg1252 or "Latin Alphabet, Western Europe / Americas".
The solution would be to read the font table definitions and map the definitions to a Charset. Then use this Charset, falling back to the main Charset of the document, if we don't have a specific one for this font. I will attempt to fix this in a pull request.
In cases where a .rtf contains text in different languages that use different charset codepages, only the main codepage is used for decoding of characters.
This is how it should look:
This is how it actually looks:
In this particular case, we have the following font definition in the fonttable:
{\f6\fswiss\fcharset128 "MS Mincho";}
Charset 128 in rtf parlance is (according to wikipedia)Windows-932
: Japanese.The text itself then contains
{\f6\fs24\htmlrtf0 \'8d\'9f\'92\'76\'8c\'68\'97\'e7\htmlrtf\f5}
which references\f6
in the font table and tells us to interpret the bytes in codepageWindows-932
. However, the code only uses the main Codepage of the document, which in this case is\ansicpg1252
or "Latin Alphabet, Western Europe / Americas".The solution would be to read the font table definitions and map the definitions to a Charset. Then use this Charset, falling back to the main Charset of the document, if we don't have a specific one for this font. I will attempt to fix this in a pull request.