Open HsuWanTing opened 6 months ago
Thank you for your efforts! Great table with results to compare!
Where did the difference come from?
Please check other issues/discussions about speed, batches and multiple GPU usage for ideas. For example (but not limited to), https://github.com/EleutherAI/lm-evaluation-harness/issues/1625 https://github.com/EleutherAI/lm-evaluation-harness/issues/704#issuecomment-1670189773
Thank @LSinev for the quick reply. I've checked the issues you linked and also searched for some others by myself. Most of the issues focus on the difference between different batch sizes which is usually very small. I understand there will be some order difference so the loglikelihood might be slightly different and this small difference is acceptable to me.
However, I didn't find the issue about the difference between different numbers of GPUs. In my case, the accuracy drops 0.3~0.5 when using a single GPU. I think the drop is quite large. Is this also an expected result?
Is this also an expected result?
No idea. According to your results from table it is also task dependent issue. You may want to further research this case with deep diving in code an logging. May be even use not yet merged PRs like https://github.com/EleutherAI/lm-evaluation-harness/pull/1731 to check consistency of tokenization (may be some adding of special tokens happens incorrectly), propagation of seeds, splitting (and restoring original order) of batches and so on.
If even just batch size makes difference, I suppose multiple GPU may propose even more difference.
I'm using lm-eval v0.4.2 to evaluate Llama 7b on the open llm leaderboard benchmark. I found that there are accuracy gaps between single GPU and multiple GPUs as below. (I used data parallel)
Single GPU got overall lower accuracies. ARC-c, hellaswag and GSM8K drop 0.3~0.5. I thought that data-parallel only speeds up the evaluation. Where did the difference come from?
Below is the command line I used for ARC-c. Use CUDA_VISIBLE_DEVICES to control the number of GPUs.