g0v / laweasyread-data

MIT License
1 stars 2 forks source link

Some characters are not converted to UTF-8 #9

Open kong0107 opened 11 years ago

kong0107 commented 11 years ago

For example, rawdata/utf8_lawstat/version2/01011/0101136061100.html ended with exception. The 2-byte character of that word is 0xFECA, which is not defined in Big5. See Wikipedia for the range of undefined characters. I guess characters in those ranges cannot be converted to UTF-8.

While trying to parse with Ruby's String#encode, I list those throw Encoding::UndefinedConversionError while trying to convert from UTF-8 to Big5 in error.log of my own project. Some of them match the bug here. Maybe it would help.

kong0107 commented 11 years ago

I'll list some I know here: