Abhijit-2592 / spacy-langdetect

A fully customisable language detection pipeline for spaCy
MIT License
93 stars 6 forks source link

Unconsistent confidence score #3

Open ZodiacFRA opened 4 years ago

ZodiacFRA commented 4 years ago

Hello,

I'm using spacy-langdetect with Python 3.7.6 on Fedora 31 with an Intel(R) Core(TM) i7-7500U CPU I'm using it to detect the language of sentences in a text (which contains both french and english sentences)

I've ran into a strange problem when executing

english_nlp_model = spacy.load("en_core_web_sm")
english_nlp_model.add_pipe(LanguageDetector(), name='language_detector', last=True)

doc = english_nlp_model(my_text)
for sent in doc.sents:
    print(f"-> {sent._.language} | {sent._.language}")

For most of the results the two values are quite similar (for a float) and I did not show it in this excerpt, but you can see that for some of the values, it's quite different:

-> {'language': 'fr', 'score': 0.7142844894042983} | {'language': 'fr', 'score': 0.9999969204393719}
-> {'language': 'it', 'score': 0.9999932252642086} | {'language': 'it', 'score': 0.7142820534070277}
-> {'language': 'fr', 'score': 0.9999980859423542} | {'language': 'fr', 'score': 0.9999945446302498}
-> {'language': 'nl', 'score': 0.8571380236556732} | {'language': 'nl', 'score': 0.9999957626562326}

Any ideas on what could cause this? Have a great day

EDIT: some languages change as well, but it's quite rare (imo it's just because there is less languages possible, so it's not as visible, but both are affected by the same problem)

-> {'language': 'fi', 'score': 0.5714279768513012} | {'language': 'de', 'score': 0.5714273200401079}

EDIT2: I tried to print it 3 times just to be sure, and it's often different in all of them:

-> {'language': 'fr', 'score': 0.9999958708843195} | {'language': 'fr', 'score': 0.7142804945594394} | {'language': 'fr', 'score': 0.8571393420155614}

EDIT3: Just for information, my text is mostly composed of notes taken quickly, which obviously doesn't help detection but helps highlighting the problem I think, I've tried with normal english text and the scores do change a little bit, but it's not as visible as in mine.

aamcintosh commented 3 years ago

I have also noted inconsistent scores. The code I was using went like this:

def get_lang_detector(nlp, name):
    return LanguageDetector()

eng_nlp = spacy.load("en_core_web_lg")
Language.factory("language_detector", func=get_lang_detector)
eng_nlp.add_pipe('language_detector', last=True)
...........................................
docs = self.ent_nlp.pipe(sentences)
for doc in list(docs):
  doc = self.eng_nlp(sentence)
  langs = []
  scores = []
  sentence_threshold = 0.90
  for i in range(5):
      lang = doc._.language["language"]
      score = doc._.language.["score"]
      langs.append(lang)
      scores.append(score)
  oops = False
  for i in range(5):
      oops = oops or langs[0] != langs[i]
  if not oops and lang != "en":
      sign_0 = math.copysign(1, scores[0]-sentence_threshold)
      sign_i = math.copysign(1, scores[i]-sentence_threshold)
      oops = oops or sign_0 != sign_i
  if oops:
      print langs and scores

I also do the same thing for the paragraph using " ".join(sentences) I get this (PMID is a PubMed paper ID.) This complaint is about one of the "sentences" in the abstract of that paper, and then about the entire abstract.

PMID 134435 Sentence oops:  Langs: ['en', 'en', 'en', 'de', 'en']
PMID 134435 Sentence oops:  Scores: [0.8571390606714735, 0.8571390295157906, 0.9999958831576141, 0.5714292463415062, 0.7142828647104207]
PMID 134435 Paragraph oops.  Langs: ['en', 'en', 'en', 'en', 'de']
PMID 134435 Paragraph oops.  Scores: [0.7142831346993775, 0.7142851057380656, 0.7142834277378337, 0.7142846675001696, 0.7142847751645527]

and this. There is only trouble with the score in this one, but it's all over the map:

PMID 55295 Paragraph oops.  Langs: ['en', 'en', 'en', 'en', 'en']
PMID 55295 Paragraph oops.  Scores: [0.8571390387840068, 0.9999961964510007, 0.8571401002100192, 0.7142855836268951, 0.8571390751698879]

I understand that outcome may be determined by a random start in fitting a model, but this business of getting a different answer every time I check the result is more than a little counterintuitive.

Spacy is correct to be confused abut 134435, since half the abstract is in English and the other half is in German. However in this case the "sentence" as defined in the input data was the complete "paragraph". Both were just over 200 words. I would have expected a little more consistency. It doesn't matter here because the score is relatively low, but I wish it would be more consistent in cases that do matter.

The abstract for 55295 is only 38 words, and while it is in English, there are a lot of medical terms. I could understand a somewhat low score, but the score here varied between 0.71 and 0.999996 the 5 times I read it.

This was with Spacy 3.0.6 and spacy-langdetect 0.1.2

As I was writing this, I had a sudden thought. This is being done in a multiprocessing environment. The creation of eng_nlp is done after each individual process has started up. Is it possible that something is shared when it shouldn't be? It doesn't seem like it, but I thought it worth mentioning. The processes are not spawned by langdetect, for whatever that's worth.

davebulaval commented 3 years ago

The problem is that this package is mostly a wrapper around the langdetect one. It uses the factory to create a lang detector but the later init it each time the function detect_langs is called. So since, the wrapper calls this function one time for all the doc, one time for each sentence of the doc and one time for each token of each sentence, it init constantly. For example, for a doc of 1 sentence of 10 tokens, it does it 12 times. That process has no sense at all, but for langdetect it does since you call detect_langs only one time on a text.

So yes, the seed is not the same all the time. But, you can set the seed on the factory of langdetect or manually can set the seed of the package. It's not clean but here is a quick fix

from langdetect import DetectorFactory
from spacy_langdetect import LanguageDetector

def get_lang_detector(nlp, name):
    DetectorFactory.seed = 42 # we set the seed of for the langdetect package
    return LanguageDetector()

To make it simpler, I've released a package base on the fork of this project.

aamcintosh commented 3 years ago

Unfortunately setting the seed only solves half the problem. Once you set the seed, you get the same score every time for the same text. However, if you change the seed you can get a very different score and language for the same text. If you make the central loop in my example something like this (yes I know there was a stray period in the original):

  for i in range(5):
      DetectorFactory.seed = 42 + 10*i
      d = doc._.language
      langs.append(d["language"])
      scores.append(d["score"])

you can still get answers that are very different. Setting the seed just makes them the same every time you run the code. (Doing this also eliminates the "it changes every time I look at it" problem if you don't set the seed.)

I have finally settled on setting on asking for N different results using different seeds. (N=5 mostly.) If the languages differ I throw the text away as being possibly multi-lingual. If the languages are all the same but the scores are too low (look at the min or the median) I also throw the text away. If the languages are all the same and the scores are high enough I keep the text. This isn't fool proof, but it's not horrible. It gets fooled by paragraphs that are mostly recitations of the names of organic compounds, and by short English paragraphs that contain the names/addresses of hospitals in non-English speaking countries.

davebulaval commented 3 years ago

Do you have a reason to do the execution 5 times?

It's sure that if you have a different seed since it is a statistical model, it will give different results each time. But, if you fix the seed FOR ALL the execution, then it will always be the same. Also, I don't think langdetect is that good... Like Hello is not classified as English...

aamcintosh commented 3 years ago

5 is just a fairly arbitrary small number. 3 is just a little too small.

Why multiple tries? Setting the seed just hides the inconsistent behavior. If I use two different seeds, I can sometimes get two very different scores, e.g. 0.75xxxx and 0.9999xxxx. Assuming langdetect gives a consistent answer for the language, the score is a random variable if the seed is chosen at random. Think about using the median to summarize that distribution. (The median is better than the mean given the skewness of the distribution.) A sample mean is a good estimate of the distribution mean. You can use this to do a test of how well the text matches the principal language. I'm not sure how easy it would be to make probability statements.

davebulaval commented 3 years ago

Ok, I got it; it was to do a mean.

Yeah, I agree with you that langdetect is not robust. I find it really odd to see such a difference between seeds. My solution is more always to get the same results when running something with it, and it does not fix the randomness problem of some results.