Mimino666 / langdetect

Port of Google's language-detection library to Python.
Other
1.71k stars 196 forks source link

Minimum Text Length Threshold for Reliable Language Detection in Langdetect #110

Open Chetan-Yeola opened 10 months ago

Chetan-Yeola commented 10 months ago

What is considered a 'short text' in langdetect, and is there a specific minimum text length threshold for reliable language detection?

jeanbaptisteb commented 4 months ago

@Chetan-Yeola According to the presentation page of this other library , langdetect performs poorly on texts with length similar to twitter messages ("For very short text snippets such as Twitter messages, they do not provide adequate results."). Which means anything less than 280 characters might give poor results, assuming the page does not exaggerate the problem. However, the page is a bit vague, and the threshold (if any) might be higher than 280 characters. It also probably depends on the language considered (I guess that some languages may be much easier to detect than others -e.g. consider detecting Hebrew, which uses a rare alphabet, vs. detecting Spanish, which is very similar to other Romance languages).

But you could try and test automatically with a large sample of short texts taken from various language instances of Wikipedia, to see if the error rate is OK relative to your requirements. The previous page does not mention the classification error rate they observed to make this statement, so if your own requirements relative to the error rate are very liberal, it may be worth take the time to test.