FGRibreau / node-language-detect

🇫🇷 NodeJS language detection library using n-gram
http://blog.fgribreau.com/2011/07/week-end-project-nodejs-language.html
MIT License
397 stars 45 forks source link

Unexpected translation results #1

Closed AvianFlu closed 13 years ago

AvianFlu commented 13 years ago

I'm attempting to use your library in an IRC bot that has a twitter feed, to help reduce the non-English tweets that make it through the filtering. The filtering was mostly successful, except for a couple like this. (The first result from your language detector is printed first, then the tweet itself. I have left out the user names attached to the tweets.)

 [ 'pidgin', 0.22029761904761902 ]
 node-db-oracle - Oracle database bindings for Node.js http://bit.ly/l7T8mS

 [ 'danish', 0.14746268656716421 ]
 Hiring Systems Engineer blog.nodejitsu.com scaling node.js ... http://t.co/IBUYTHk

I can completely understand where an algorithm would have difficulty with the strange text that most Tweets tend to consist of - but do you have any thoughts here? Even if I had to make a list of three or four languages that can count as 'probably English', I'd be okay with it, but I haven't really used this kind of library before - so I'd like your input.

Thanks again for putting this together - it's a big improvement for me already!

FGRibreau commented 13 years ago

Hi !

I've checked these tweets with the original library (PEAR::LanguageDetect) and here is what came out:

node-db-oracle - Oracle database bindings for Node.js http://bit.ly/l7T8mS
[pidgin] => 0.220297619048
[english] => 0.16375
[danish] => 0.160119047619
[dutch] => 0.15625

Hiring Systems Engineer blog.nodejitsu.com scaling node.js ... http://t.co/IBUYTHk
[danish] => 0.147462686567
[norwegian] => 0.147213930348
[latin] => 0.142885572139
[portuguese] => 0.134676616915
[dutch] => 0.132039800995

Same as node-language-detect. So these results are inaccurate because of the small input length. If you want to be sure about the tweet's langages the first result must have a score over 0.3-0.4 otherwise chance are there will be wrong detection.

AvianFlu commented 13 years ago

Thanks! I figured it would ultimately be Twitter's fault. :-P

FGRibreau commented 13 years ago

Don't get me wrong, it's just that the input data is insufficient sometimes :s

AvianFlu commented 13 years ago

Yeah, and I'm sure that the shortlink URLs some tweets contain don't help, either. I'll just put together a 'best guess' system and see what I can do.