AGI-Edgerunners / LLM-Adapters

Code for our EMNLP 2023 Paper: "LLM-Adapters: An Adapter Family for Parameter-Efficient Fine-Tuning of Large Language Models"
https://arxiv.org/abs/2304.01933
Apache License 2.0
1.01k stars 91 forks source link

Possible Bug In Handling Batch Size During Common Sense Evaluation #61

Open mchorton opened 3 months ago

mchorton commented 3 months ago

I am debugging poor performance of a model I'm experimenting with. It gets pretty good CoreEN scores, but it is generating nonsensical responses when running commonsense_evaluate.py. For instance, it gives repeated tokens for a lot of inputs.

After some more digging, it looks like this generation call is causing a problem when the batch size is greater than 1.

In this case, padding tokens will be added to many of the batch elements. The generate() call isn't given an indication of how many padding tokens are being used. This causes my model to generate garbage outputs in cases where lots of padding appears in a batch. If I change the batch size to 1, outputs are much more reasonable.

It seems like this could be the cause of #38 . In that case, users are evaluating with batch sizes greater than 1, which seems likely to cause an issue.

Also FWIW, I am not sure why commonsense_evaluate.py allows users to choose a batch size, but evaluate.py does not. I'm guessing that's why I'm seeing issues about evaluate.py but not commonsense_evaluate.py.

HZQ950419 commented 3 months ago

Hi, Many thanks for pointing out this issue! I added batch decoding to commonsense_evaluate.py for acceleration as the target response of the commonsense task is very short. But the inputs in the commonsense task can be very long, so I used batch_size=1 for my experiments. That's why I didn't encounter this issue.

I'm trying to figure out the solution of this issue. If you have a method in mind to fix it, it's nice to submit a PR.