FGRibreau / node-language-detect

🇫🇷 NodeJS language detection library using n-gram
http://blog.fgribreau.com/2011/07/week-end-project-nodejs-language.html
MIT License
399 stars 45 forks source link

language detection failing on basic English input #41

Closed illerucis closed 2 years ago

illerucis commented 2 years ago

hey folks -

testing out this library with some sample posts from our application, and I'm getting some strange results:

console.log
  post text: take me back home

console.log
  language detection results: [ [ 'pidgin', 0.3275 ], [ 'hawaiian', 0.2816666666666666 ] ]
console.log
post text: i like your hair

console.log
  language detection results: [ [ 'hawaiian', 0.26625 ], [ 'norwegian', 0.25145833333333334 ] ]

I installed via npm install languagedetect --save

illerucis commented 2 years ago

english is not in the top 5 results

 console.log
   post text: take me back home

console.log
  language detection results: [
    [ 'pidgin', 0.3275 ],
    [ 'hawaiian', 0.2816666666666666 ],
    [ 'hausa', 0.265625 ],
    [ 'dutch', 0.20395833333333335 ],
    [ 'slovene', 0.19854166666666662 ]
  ]

console.log
  post text: i like your hair

console.log
  language detection results: [
    [ 'hawaiian', 0.26625 ],
    [ 'norwegian', 0.25145833333333334 ],
    [ 'icelandic', 0.23479166666666662 ],
    [ 'turkish', 0.22270833333333329 ],
    [ 'welsh', 0.21479166666666671 ]
  ]
FGRibreau commented 2 years ago

Yes, since language-detect is trigram based which is statistically based, sometimes it needs more input data to bring the right result :)