Possible Bug In Handling Batch Size During Common Sense Evaluation

I am debugging poor performance of a model I'm experimenting with. It gets pretty good CoreEN scores, but it is generating nonsensical responses when running commonsense_evaluate.py. For instance, it gives repeated tokens for a lot of inputs.

After some more digging, it looks like this generation call is causing a problem when the batch size is greater than 1.

In this case, padding tokens will be added to many of the batch elements. The generate() call isn't given an indication of how many padding tokens are being used. This causes my model to generate garbage outputs in cases where lots of padding appears in a batch. If I change the batch size to 1, outputs are much more reasonable.

It seems like this could be the cause of #38 . In that case, users are evaluating with batch sizes greater than 1, which seems likely to cause an issue.

Also FWIW, I am not sure why commonsense_evaluate.py allows users to choose a batch size, but evaluate.py does not. I'm guessing that's why I'm seeing issues about evaluate.py but not commonsense_evaluate.py.

AGI-Edgerunners / LLM-Adapters

Possible Bug In Handling Batch Size During Common Sense Evaluation #61