Mimino666 / langdetect

Port of Google's language-detection library to Python.
Other
1.71k stars 196 forks source link

Optimising for loops #77

Open vmdhhh opened 3 years ago

vmdhhh commented 3 years ago

Should we take the init_factory() outside the detect() so that if we are using this function on dataframes or in loops, it won't have to load the 55 language files over and over again? What do you think? @Mimino666

rafguns commented 3 years ago

For what it's worth, I hacked around this as follows:

from langdetect import DetectorFactory, PROFILES_DIRECTORY

factory = DetectorFactory()
factory.load_profile(PROFILES_DIRECTORY)
detector = factory.create()

def detect(text, detector=detector):
    detector.text = ""
    detector.append(text)
    return detector.detect()

Obviously not a proper solution but might be useful as a temporary speed-up. Hopefully this can be fixed within langdetect itself.

trislee commented 2 years ago

The detect function in https://github.com/Mimino666/langdetect/issues/77#issuecomment-880545747 needs to be updated to something like:

 def detect(text, detector=detector):
    detector.text = ""
    detector.langprob = None
    detector.append(text)
    return detector.detect()

because in the get_probabilities method, the previously-generated self.langprob is re-used if it's not None. This means that, if running the detect function on a list of strings from various languages, it will always return the language detected from the first string.