ebasaran / language-detection

Automatically exported from code.google.com/p/language-detection
0 stars 0 forks source link

How to improve detection rate? #55

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
I have document in multiple languages (4 or 5) and it is organized more or less 
like this
* 10 pages in language A,
* 10 pages in language B,
* 10 pages in language C
and so on.

Total text length is 475 216 characters so the text is quite long. I've set 
maxTextLength to 10 million characters so whole text should be analyzed.

Unfortunately in 9 out of 10 runs (ok, maybe 4 out of 5 ;) I get only one 
language detected with probability of 0.9999999 or similar. I'm aware that 
multiple language documents are not supported but still I would consider this a 
bug since 80% of text in the document is NOT in detected language. Even if I 
get two languages reported as probable their distribution is skewed (0.85 to 
0.14).

I suspect that text is sampled somehow and sampling is skewed towards begining 
of the text. How can it be changed to improve detection? It's enough for my use 
case if program output would be similar to:
en: 0.13
de: 0.14
fr: 0.15
it: 0.13
etc.

Would changing Detector.ITERATION_LIMIT to larger value help? 

Original issue reported on code.google.com by MKlepacz...@gmail.com on 30 Apr 2013 at 11:47

GoogleCodeExporter commented 9 years ago
langdetect can not retreat document in multiple languages.
You should separate your document into paragraphs and detect each ones, then I 
suppose you will obtain results close to you hope.

Original comment by nakatani.shuyo on 24 Jul 2013 at 8:59