asahi417 / lmppl

Calculate perplexity on a text with pre-trained language models. Support MLM (eg. DeBERTa), recurrent LM (eg. GPT3), and encoder-decoder LM (eg. Flan-T5).
MIT License
132 stars 11 forks source link

A quite large perplexity issue #5

Closed gotutiyan closed 1 year ago

gotutiyan commented 1 year ago

Hi, thank you for your developing lmppl.

I have a question about too large perplexity.

I installed lmppl and execute the commands described in the README as follows, but get_perplexity() returns quite large value. Is there something wrong with the procedure?

>>> import lmppl
>>> scorer = lmppl.LM('gpt2')
Using pad_token, but it is not set yet.
>>> text = [
    'sentiment classification: I dropped my laptop on my knee, and someone stole my coffee. I am happy.',
    'sentiment classification: I dropped my laptop on my knee, and someone stole my coffee. I am sad.'
]
>>> ppl = scorer.get_perplexity(text)
100%|██████████| 1/1 [00:00<00:00,  3.03it/s]
>>> ppl
[4.2328431180493815e+43, 4.732356477497072e+43] # <-- They are quite large, there seems to be something wrong.

Version of some modules in my environment:

Thank you.

asahi417 commented 1 year ago

Hi, thank you so much to find out the issue! I figured out that for models such as gpt-2 and opt, they don't have a padding token as a default, so I added one at loading the models (https://github.com/asahi417/lmppl/blob/main/lmppl/ppl_recurrent_lm.py#L70). If a new padding token was added in a post-hoc manner, the logit on the newly added padding token became high, and that resulted in a explosive perplexity in the end. I fixed it by disregarding the newly added padding token at computing negative log likelihood, and now it produces reliable scores.

asahi417 commented 1 year ago

I also double checked the perplexity given by the huggingface introduction https://huggingface.co/docs/transformers/perplexity and confirmed that the one from lmppl matched to those produced with the introduction.