brendonh / pyth

Python text markup and conversion
MIT License
89 stars 79 forks source link

CJK characters support for RTF parse #15

Closed darkranger-red closed 9 years ago

darkranger-red commented 11 years ago

Hello guys,

CJK means Chinese, Japanese, and Korean. Many ancient RTF writer doesn't store these characters in Unicode, and use pyth to read CJK characters from these ancient RTF documents would cause "UnicodeDecodeError" due to CJK codecs actually use 4 hex digits not 2.

I did modified plugins/rtf15/reader.py to resolve my own needs. But I still hope someone can write a better code to deal with this issue.

1)Add this first:

from binascii import unhexlify

2)Add number 936:

# All the ones named by number in my 2.6 encodings dir
_CODEPAGES_BY_NUMBER = dict(
    (x, "cp%s" % x) for x in (37, 1006, 1026, 1140, 1250, 1251, 1252, 1253, 1254, 1255,
                              1256, 1257, 1258, 424, 437, 500, 737, 775, 850, 852, 855,
                              856, 857, 860, 861, 862, 863, 864, 865, 866, 869, 874,
                              875, 932, 936, 949, 950))

3)Change to 'ignore' :

def read(self, source, errors='ignore'):

4):

            if next == "'":
                # ANSI escape, takes two hex digits
                chars.extend("ansi_escape")
                digits.extend(self.source.read(2))

                #For some asian languages, takes two more digits

                #Japanse:
                if self.charset == "cp932":
                   if self.source.read(2) == "\\'":
                      digits.extend(self.source.read(2))    
                #Simplified Chinese:       
                if self.charset == "cp936":
                   if self.source.read(2) == "\\'":
                      digits.extend(self.source.read(2))                          
                #Korean:
                if self.charset == "cp949":
                   if self.source.read(2) == "\\'":
                      digits.extend(self.source.read(2))          
                #Traditional Chinese:
                if self.charset == "cp950":
                   if self.source.read(2) == "\\'":
                      digits.extend(self.source.read(2))

                break

5)

    def handle_ansi_escape(self, code):
        cjk = code
        code = int(code, 16)

        if isinstance(self.charset, dict):
            uni_code = self.charset.get(code)
            if uni_code is None:
                char = u'?'
            else:
                char = unichr(uni_code)

        else:
            if code <= 255:
               char = chr(code).decode(self.charset, self.reader.errors)
               self.content.append(char)
            else:
               char = unhexlify(cjk).decode(self.charset, self.reader.errors)
               self.content.append(char)
brendonh commented 11 years ago

This is definitely something I'd like to support, but I'm not sure how (if at all!) it's covered by the RTF specs. Can you give me a couple of example RTF files to test against?

darkranger-red commented 11 years ago

OK, I will collect some files when I back to the office on Monday.

darkranger-red commented 11 years ago

https://gist.github.com/gists/3850304/download

yairchu commented 9 years ago

Btw maybe using incrementaldecoder would be the right way?

brendonh commented 9 years ago

Multibyte codepages are fixed in 381a3067add074fb5cf48fbc5e56f5b7ba28d795 and your test file now works.