JohnWang0512 / tesseract-ocr

Automatically exported from code.google.com/p/tesseract-ocr
Other
0 stars 0 forks source link

Update doc manual pages: add all supported languages #1268

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
Re: svn trunk rev. 1133

When comparing the outputs of "make install-langs" (or "tesseract 
--list-langs") and "man tesseract" I recently found differences. 

For example, Tesseract appears to support "deu-fra" and "ita-old", but these 
'languages' are not listed in "man tesseract".

For several reasons I think it would be useful to

* update the doc manual page with the full set of supported language ; and
* change the output of the --list-language option (or add an option 
--list-language-with-description) so that it also shows the language as 
readable text like

"deu (German)"
"deu-fra (German Fraktur)"

and so on.

Original issue reported on code.google.com by syr...@gmail.com on 6 Aug 2014 at 7:31

GoogleCodeExporter commented 9 years ago
I made an overview list of the codes in subdirectories and in the tesseract doc 
file:
Those entries, where the key equals the value (e.g. afr) are available, but are 
not listed in the documentation.

Let me know if you want me to (try to) supply a patch for this.

Array
(
    [afr] => afr
    [Albanian] => sqi
    [Arabic] => ara
    [Azerbauijani] => aze
    [bel] => bel
    [ben] => ben
    [Bulgarian] => bul
    [Catalan] => cat
    [Cherokee] => chr
    [Croation] => hrv
    [Czech] => ces
    [Danish] => dan
    [Danish (Fraktur)] => dan-frak
    [deu-frak] => deu-frak
    [Dutch] => nld
    [English] => eng
    [equ] => equ
    [Esperanto] => epo
    [Estonian] => est
    [eus] => eus
    [Finnish] => fin
    [French] => fra
    [frk] => frk
    [Galician] => glg
    [German] => deu
    [grc] => grc
    [Greek] => ell
    [Hebrew] => heb
    [Hindi] => hin
    [Hungarian] => hun
    [Indonesian] => ind
    [isl] => isl
    [Italian] => ita
    [ita_old] => ita_old
    [Japanese] => jpn
    [kan] => kan
    [Korean] => kor
    [Latvian] => lav
    [Lithuanian] => lit
    [mal] => mal
    [mkd] => mkd
    [mlt] => mlt
    [msa] => msa
    [Norwegian] => nor
    [Old English] => enm
    [Old French] => frm
    [osd] => osd
    [Polish] => pol
    [Portuguese] => por
    [Romanian] => ron
    [Russian] => rus
    [Serbian] => srp
    [Simplified Chinese] => chi_sim
    [slk-frak] => slk-frak
    [Slovakian] => slk
    [Slovenian] => slv
    [Spanish] => spa
    [spa_old] => spa_old
    [swa] => swa
    [Swedish] => swe
    [Tagalog] => tgl
    [Tamil] => tam
    [Telugu] => tel
    [Thai] => tha
    [Traditional Chinese] => chi_tra
    [Turkish] => tur
    [Ukrainian] => ukr
    [Vietnamese] => vie
)

Original comment by syr...@gmail.com on 7 Aug 2014 at 9:24

GoogleCodeExporter commented 9 years ago
I also want to patch tesseract, so that the command line option 
--list-languages-with-description gives a list with code and language name. (I 
mentioned this already)

Original comment by syr...@gmail.com on 7 Aug 2014 at 9:27

GoogleCodeExporter commented 9 years ago
1. Do not mix 2 different topics in one issue.
2. Updating doc for 3.03 with releasing 3.03 language files is strange.
3. I am against "--list-languages-with-description" First of all are several 
intention (e.g. removing language files from tesseract engine repository, 
separate community training files, other distribution of language file... ) so 
the "--list-languages-with-description" will never provide accurate out.

Next: tesseract is following ISO 639-3 standard for language filename. If 
somebody wants the know what does it mean (s)he should use the relevant doc[1]. 
And there is a legal issue - Can you implement ISO 639-3 standard information 
under Apache 2 licence?

[1] http://www-01.sil.org/iso639-3/

Original comment by zde...@gmail.com on 8 Aug 2014 at 11:51

GoogleCodeExporter commented 9 years ago
@Z: I got your points, and view. My list way mainly to tell you, what's 
different in the manual, and in the checked-out version (see my list above).

What's about adding a link to http://www-01.sil.org/iso639-3/codes.asp in both 
the manual, and the --list-languages output ?

Original comment by syr...@gmail.com on 8 Aug 2014 at 7:20

GoogleCodeExporter commented 9 years ago
That should not be problem. But I would suggest to keep open this issue until 
announced changes[1] will take place.

[1] https://groups.google.com/forum/#!msg/tesseract-dev/kJEYuvEZuDs/uYBBwwOJE_IJ

Original comment by zde...@gmail.com on 9 Aug 2014 at 6:08