Closed SaiedAlshahrani closed 1 year ago
Could you share the complete code where I could reproduce the error?
Thank you for getting back to me. The code works fine. I figured out the problem and solved it. My inputs had many whitespaces and caused the problem of the tensors mismatch. I will close this issue.
Can I ask how you fixed the whitespace issue? I am struggling with the same thing, but whenever I try to fix it, I get a different error.
The code was fine with English texts, but once you use it with a multilingual text, Arabic in my case, you get this error. I did many things to reduce this error, like removing extra whitespace, newlines(\n), tabs (\t), and even digits. The code worked but was still buggy.
Therefore, I had to write my code from scratch and adopted a different approach to compute the perplexity for MLMs. I used the work of Salazar et al., "Masked Language Model Scoring", https://arxiv.org/abs/1910.14659. I noticed that their work has a limitation: the pseudo-perplexity score is susceptible to the length of the sentences, so I have to manage that by setting min and max for sentences, unless you do not care. In my case, it mattered because I was comparing many MLMs' performances, so a consistent comparison is crucial.
You can look at this StackOverflow thread "How to calculate perplexity of a sentence using huggingface masked language models?" or take a look at my rough implementation: https://github.com/SaiedAlshahrani/pImplications/blob/main/pseudo_ppl.py. You might need to change many things to get it to work for your needs.
I always get this error when I apply the
get_perplexity()
method on a list of text or a data frame column.RuntimeError: Sizes of tensors must match except in dimension 0. Expected size 194 but got size 193 for tensor number 125 in the list.
Any idea what is causing this error?
Thank you