akorentlab / tesseract-ocr

Automatically exported from code.google.com/p/tesseract-ocr
Other
0 stars 0 forks source link

ISO 639-2/B code for Chinese #886

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
ISO 639-2 assigns two codes to some languages: B (bibliographic) and T 
(terminology). For Chinese, they are:
chi (B)
zho (T)

Tesseract uses the B code for Chinese. This is inconsistent: for every other 
language that have two distinct 639-2 codes, it uses the T code.

Disclaimer: I don't speak Chinese, and have no personal interest in running 
Tesseract over Chinese text. I just thought that this inconsistency is not 
intentional, so I brought it to your attention.

Reference:
http://www.loc.gov/standards/iso639-2/php/code_list.php

Original issue reported on code.google.com by jwilk@jwilk.net on 3 Apr 2013 at 8:25

GoogleCodeExporter commented 9 years ago

http://chinese.stackexchange.com/questions/6147/which-one-of-these-two-iso-639-2
-code-refers-to-traditional-chinese-chi-or-zho

chi and zho stands for the Chinese language, but not one of its writing forms.

Specific codes for traditional Chinese and simplified Chinese widely in use 
being zh-Hant and zh-Hans respectively, specified by IETF and also RFC 5646.

The use of improvised codes "tc" and "sc", which abbreviate "Traditional 
Chinese" and "Simplified Chinese", are often observed too in Chinese websites.

Original comment by jkts2...@googlemail.com on 25 Jul 2014 at 9:02