Open GoogleCodeExporter opened 8 years ago
Polish's profile has more total frequency than Croatian, so its ratio is less.
And your example has many noise (mam, to, roku.Nic, super), so it might mistake.
But I know this is excuse :D
One way is to use suitable profiles to your data.
I created 17 language profiles with twitter corpus and committed here.
http://code.google.com/p/language-detection/source/browse/#git%2Fprofiles.sm
(I will bundle them into language-detection later or sooner)
I tried language detection for your example with these profiles, then it
outputted the correct result 'pl'.
Original comment by nakatani.shuyo
on 17 Jan 2012 at 7:47
I mistook in the previous comment.
The mentioned profiles do not include Croatian, so it cannot output 'hr'. Sorry.
If you want to detect both languages more precisely, you may generate profiles
with some appropriate corpus.
langdetect has a tool to generate profiles from Wikipedia database or arbitrary
text. See here.
http://code.google.com/p/language-detection/wiki/Tools
Original comment by nakatani.shuyo
on 17 Jan 2012 at 8:22
Thanks for quick reply :)
I'll giva a try to twitter profiles.
Original comment by mic...@senti1.com
on 17 Jan 2012 at 9:24
Right now I need language detection to exclude non-polish texts. So lack of
Croatian is not a problem in twitter corpus :)
Returning to my example. Why getProbabilities() didn't return polish at all? If
it has bigger total frequency shouldn't it return i.e. [hr: 99% ,pl: 40%]? Is
there any minimum probability thereshold?
Original comment by mic...@senti1.com
on 17 Jan 2012 at 9:31
If you have only one profile (polish) you can get ~100% prob for most texts, so
more profiles == better results (look at implementation of updateLangProb).
Language-detection library is not that good for short textes to exclude or
recognize correct language. For better twitter detection you should do more
than detect lang of one tweet. Language detect is perfect for longer texts.
Maybe different language profiles (not based on Wikipedia) could give better
results.
Original comment by markowsk...@gmail.com
on 18 Jan 2012 at 12:48
[deleted comment]
> 4
getProbabilities() returns 'probabilities', so their total nearly equals to
100%.
Original comment by nakatani.shuyo
on 25 Jan 2012 at 3:07
>5
Thanks,
It is effective to use multiple tweets as single text for twitter detection.
Original comment by nakatani.shuyo
on 25 Jan 2012 at 3:30
Original issue reported on code.google.com by
mic...@senti1.com
on 16 Jan 2012 at 8:12