BlenderBotSmall fluency

Hi there. I have a question about BlenderBot Small 90M.

I have applied a safety framework to blenderbot small to force safe generations. Now I need to measure the "Fluency" of my generated safe answers. The common practice in this case is to use my generations as a label to a larger model and compute perplexity. I tried the same thing with LLAMA2. However, the calculated perplexities are very high in the range of 400k. I assume the reason is the huge gap between the two model sizes (blenderbot small vs LLAMA2). How do you think I could measure the fluency of my generated answers based on blenderbot small?

facebookresearch / ParlAI

BlenderBotSmall fluency #5084