andyperlitch / language-detection

Automatically exported from code.google.com/p/language-detection
0 stars 0 forks source link

Wikipedia less than optimal training database #43

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?
1. Train language uing wikipedia
2. Get detected language and score for parts of big corpus
3. check wrongly indicated languages

Wikipedia is extremely full of foreign subjects and proper names, and is not a 
good trainer for everyday language.
Better profile might be generated from well-maintained corpora

Original issue reported on code.google.com by i...@taaltik.nl on 16 Nov 2012 at 5:24