Helsinki-NLP / OpusFilter

OpusFilter - Parallel corpus processing toolkit
MIT License
101 stars 18 forks source link

LMclassify always score 1 #68

Closed wuyangjian closed 7 months ago

wuyangjian commented 8 months ago
def classify(self, sentence):
    """Return a dictionary of classification probabilities for the sentence"""
    logprobs = {}
    maxlp = -math.inf
    for key, model in self.lms.items():
        tokens = self.tokenizers[key].tokenize(sentence)
        logprob = -negative_logprob(model, tokens)
        logprobs[key] = logprob
        maxlp = max(maxlp, logprob)
    probs = {key: 2**(lp - maxlp) for key, lp in logprobs.items()}
    psum = sum(probs.values())
    if psum > 0:
        probs = {key: p / psum for key, p in probs.items()}
    return probs

def score(self, pairs):
    for pair in pairs:
        scores = []
        for ref_label, sentence in zip(self.labels, pair):
            if not sentence:
                # Prevent filtering empty lines
                scores.append(1.0)
                continue
            probs = self.classify(sentence)
            if self.relative_score:
                maxp = max(probs.values())
                probs = {key: (p / maxp) for key, p in probs.items()}
            scores.append(probs[ref_label])
        yield scores

problem:

       maxlp = max(maxlp, logprob)
       probs = {key: 2**(lp - maxlp) for key, lp in logprobs.items()}
       which means lp-maxlp always equals 0
svirpioj commented 7 months ago
  maxlp = max(maxlp, logprob)
  probs = {key: 2**(lp - maxlp) for key, lp in logprobs.items()}
  which means lp-maxlp always equals 0

Yes, the most likely class i will have probs[i] = 1 at this point. Next the values are normalized so that they sum up to 1 (i.e. make a proper probability distribution), and all will be below one.

Subtracting the highest log-prob before applying the exponential function is a common trick to avoid large log-probs to get rounded to zero probabilities due to the lack of numerical precision. If you do the math, you'll notice that it doesn't affect the result after the normalization to probability values.

svirpioj commented 7 months ago

I'll close this issue now, but please do reply if there's still something that seems to be wrong.

wuyangjian commented 7 months ago

I apologize for the delayed response. I have now understood the issue correctly, and it was indeed a misunderstanding on my part. I confused the LMClassifierFilter and CrossEntropyFilter modules. Thank you very much for your answer. I really appreciate your project, as it has been very helpful to me.