Faster language detection

Abhijit-2592 / spacy-langdetect

A fully customisable language detection pipeline for spaCy

MIT License

93 stars 6 forks source link

Faster language detection #2

Open TobiasJu opened 5 years ago

TobiasJu commented 5 years ago

Hi there, currently the spaCy-language detection takes quite a while, because its doing tokenisation and sentence splitting and what not in the background. I just want to have the language for the doc, can i somehow improve the speed of spacy-languagedetect? regards!

TobiasJu commented 5 years ago

Just to give you a reference, as test i detected the language of about 4000 docs, with average 100 words:

language_guess: 26s cld2: 18s language_id: 39s spaCy-langdetect: 3334s

which equals 55 minutes. Which makes this package completely useless for my usecase. Which is quite sad, because spacy is an awesome lib!

TobiasJu commented 5 years ago

So today i digged in your lib and found the detector_factory and detect functions, which can be imported with:

 from langdetect import detector_factory
 from langdetect import detect

Which than can be directly accessed with:

detected = detector_factory.detect_langs("Das ist ein Test-Text für die Spracherkennung.")
print(detected)

[de:0.9999983500527911]

New time is: 86s I hope this will help future users of this lib.

PS: No need for spaCy at all. This works because in your spacy_langdetect.py you do: "_from langdetect import detectlangs" which can be directly imported as shown. Which leads to the question, why bother importing spaCy and doing all the unnecessary steps for a simple language detection like this?

JonanOribe commented 4 years ago

detector_factory.detect_langs

Thanks for the approach, really improves the performance

MichaelJanz commented 4 years ago

Thanks for that great answer! As you specify the model used, when creating the pipeline, I am wondering which model is used by default?

lsmith77 commented 2 years ago

we are currently fasttext which is performing quite well. however since we are already using spacy models (one for english and one for german) in other parts of the app, I figured it would be interesting to use the spacy models for language detection as well.

but I am also a bit confused about how it works since it seems like it only uses one language model at a time. and now it seems to indicate that this solution here is just the integration of langdetect in spacy and not a spacy based language detection.

we used langdetect in the past already and found it not accurate enough compared to fasttext.