buganini / bsdconv

A simple but powerful DSL for charset/encoding conversion and transformation, pure C implementation with no extra dependencies
https://bsdconv.io/bsdconv/
BSD 2-Clause "Simplified" License
53 stars 6 forks source link

GB2312's em dash #18

Open Artoria2e5 opened 7 years ago

Artoria2e5 commented 7 years ago

bsdconv's GB2312 table which comes from unicode.org and went missing after EASTASIA charts became obsolete is, to some extent, similar to Unicode's Big5 table in quality. (I will use unicode.org's whatever hex to refer to GB codepoints, so add 0x8080 for EUC-CN.)

In GB2312-1980, 212A is defined as 破折号 (em dash), but the Unicode mapping gives a U+2015 (horizontal bar) instead of U+2014, apparently without reading the Chinese text at all. Hence GB2312's decoder should be changed to emit U+2014 just for proper punctuation; the encoder should be made to accept U+2014 too.

By the way, 212A is one of "Unicode" gb2312-80's incompatibilities with GBK; the other one is at 2124. You may choose to use a non-fullwidth, regular "middle dot" as GBK does and W3C CLREQ recommends typographically, but what I hope for now is just the encoder accepting U+00B7.

buganini commented 7 years ago

Please feel free to change anything about simplified chinese, since I am not native user for it, the current state is just enough for my previous use cases.

Artoria2e5 commented 7 years ago

Sure.

Artoria2e5 commented 7 years ago

Wait... With #17 how did it even work...

buganini commented 7 years ago

You can add/rewrite encoder/decoder and/or replace or add aliases..

Aliases are defined in https://github.com/buganini/bsdconv/blob/master/modules/from/alias and https://github.com/buganini/bsdconv/blob/master/modules/to/alias

After changing alias files, make alias will update https://github.com/buganini/bsdconv/blob/master/modules/inter/ALIAS-FROM.txt https://github.com/buganini/bsdconv/blob/master/modules/inter/ALIAS-INTER.txt https://github.com/buganini/bsdconv/blob/master/modules/inter/ALIAS-TO.txt https://github.com/buganini/bsdconv/blob/master/modules/inter/ALIAS-FILTER.txt

Big5 is using UAO250 as default decoder and CP950 as default encoder to achieve maximum compatibility for practical use.