RuntimeError: Sizes of tensors must match

SaiedAlshahrani commented 1 year ago

I always get this error when I apply the get_perplexity() method on a list of text or a data frame column.

RuntimeError: Sizes of tensors must match except in dimension 0. Expected size 194 but got size 193 for tensor number 125 in the list.

Any idea what is causing this error?

Thank you

asahi417 commented 1 year ago

Could you share the complete code where I could reproduce the error?

SaiedAlshahrani commented 1 year ago

Thank you for getting back to me. The code works fine. I figured out the problem and solved it. My inputs had many whitespaces and caused the problem of the tensors mismatch. I will close this issue.

NHendrickson9616 commented 1 year ago

Can I ask how you fixed the whitespace issue? I am struggling with the same thing, but whenever I try to fix it, I get a different error.

SaiedAlshahrani commented 1 year ago

The code was fine with English texts, but once you use it with a multilingual text, Arabic in my case, you get this error. I did many things to reduce this error, like removing extra whitespace, newlines(\n), tabs (\t), and even digits. The code worked but was still buggy.

Therefore, I had to write my code from scratch and adopted a different approach to compute the perplexity for MLMs. I used the work of Salazar et al., "Masked Language Model Scoring", https://arxiv.org/abs/1910.14659. I noticed that their work has a limitation: the pseudo-perplexity score is susceptible to the length of the sentences, so I have to manage that by setting min and max for sentences, unless you do not care. In my case, it mattered because I was comparing many MLMs' performances, so a consistent comparison is crucial.

You can look at this StackOverflow thread "How to calculate perplexity of a sentence using huggingface masked language models?" or take a look at my rough implementation: https://github.com/SaiedAlshahrani/pImplications/blob/main/pseudo_ppl.py. You might need to change many things to get it to work for your needs.

asahi417 / lmppl

RuntimeError: Sizes of tensors must match #2