Open TobiasJu opened 5 years ago
Just to give you a reference, as test i detected the language of about 4000 docs, with average 100 words:
language_guess: 26s cld2: 18s language_id: 39s spaCy-langdetect: 3334s
which equals 55 minutes. Which makes this package completely useless for my usecase. Which is quite sad, because spacy is an awesome lib!
So today i digged in your lib and found the detector_factory and detect functions, which can be imported with:
from langdetect import detector_factory
from langdetect import detect
Which than can be directly accessed with:
detected = detector_factory.detect_langs("Das ist ein Test-Text für die Spracherkennung.")
print(detected)
[de:0.9999983500527911]
New time is: 86s I hope this will help future users of this lib.
PS: No need for spaCy at all. This works because in your spacy_langdetect.py you do: "_from langdetect import detectlangs" which can be directly imported as shown. Which leads to the question, why bother importing spaCy and doing all the unnecessary steps for a simple language detection like this?
detector_factory.detect_langs
Thanks for the approach, really improves the performance
Thanks for that great answer! As you specify the model used, when creating the pipeline, I am wondering which model is used by default?
we are currently fasttext which is performing quite well. however since we are already using spacy models (one for english and one for german) in other parts of the app, I figured it would be interesting to use the spacy models for language detection as well.
but I am also a bit confused about how it works since it seems like it only uses one language model at a time. and now it seems to indicate that this solution here is just the integration of langdetect in spacy and not a spacy based language detection.
we used langdetect in the past already and found it not accurate enough compared to fasttext.
Hi there, currently the spaCy-language detection takes quite a while, because its doing tokenisation and sentence splitting and what not in the background. I just want to have the language for the doc, can i somehow improve the speed of spacy-languagedetect? regards!