ISO 639-2/B code for Chinese

akorentlab / tesseract-ocr

Automatically exported from code.google.com/p/tesseract-ocr

Other

0 stars 0 forks source link

ISO 639-2 assigns two codes to some languages: B (bibliographic) and T 
(terminology). For Chinese, they are:
chi (B)
zho (T)

Tesseract uses the B code for Chinese. This is inconsistent: for every other 
language that have two distinct 639-2 codes, it uses the T code.

Disclaimer: I don't speak Chinese, and have no personal interest in running 
Tesseract over Chinese text. I just thought that this inconsistency is not 
intentional, so I brought it to your attention.

Reference:
http://www.loc.gov/standards/iso639-2/php/code_list.php

Original issue reported on code.google.com by jwilk@jwilk.net on 3 Apr 2013 at 8:25

http://chinese.stackexchange.com/questions/6147/which-one-of-these-two-iso-639-2 -code-refers-to-traditional-chinese-chi-or-zho chi and zho stands for the Chinese language, but not one of its writing forms. Specific codes for traditional Chinese and simplified Chinese widely in use being zh-Hant and zh-Hans respectively, specified by IETF and also RFC 5646. The use of improvised codes "tc" and "sc", which abbreviate "Traditional Chinese" and "Simplified Chinese", are often observed too in Chinese websites.

akorentlab / tesseract-ocr

ISO 639-2/B code for Chinese #886