Closed takkyu2 closed 1 week ago
Hi @takkyu2, sorry to hear this!
I'll spend some time today and tomorrow looking into it. Meanwhile, would you mind providing the outputs you generated?
Thank you very much for your help! I attached the json files of local generation/eval results and leaderboard eval results below: results.zip
Hi @takkyu2, I re-evaluated the provided files here: yi_results.zip
I only got ~6 timeout tasks. Most of your "timeout" tasks are mainly related to the sklearn dataset download and modeling. I'd expect a longer time required to pass the tests. The current v0.1.7.0 release only focuses on the evaluation speed, so I set the time limit as 120 seconds for both ground truths and generated outputs. I've extended the time limit to 240 seconds in the upcoming v0.1.8 release: #17.
For the re-evaluated results, I got the same results as the reported ones based on the pre-generated outputs. Regarding your local outputs, I got the same scores as yours, though the number of timeout tasks is significantly reduced.
One thing I feel quite strange about is the extra spaces after commas and full stops in the prompts of your local outputs. For example, you got random. shuffle
in the docstring, but the actual one is random.shuffle
. I assume that the difference may explain the discrepancy. I didn't get the extra space during the generation. I'm now doing a new generation, and it should be finished shortly. ~I have still found no such issues in the newly generated outputs so far.~ I'm using vLLM v0.5.1, and the one in the docker should be v0.5.0.
Would you mind doing a new set of generations w/o the docker image and seeing if the extra spaces still exist? I doubt if this is due to some incompatibility of your environment. The original prompts do not have these spaces.
Hi @terryyz, thank you for the very detailed analysis, this helps a lot! Let me rerun the generation without the docker environment.
I didn't realize that space-after-comma issue, this is strange... I will check my environment like the library versions to find some issue there.
Sorry @takkyu2, here's a correction based on my newly generated outputs (new_yi_results.zip):
36.0
, which is pretty similar to what you got.The original generation was done on May 22nd, as documented on my server. The framework should be based on this version: https://github.com/bigcode-project/bigcodebench/tree/3cdf0ea6484c6c4fcb6ef26ed4bf3c7e7be1b552, when the framework was originally called WildCodeBench. There was not much difference between generate.py
and model.py
except for the module name changes. I barely touched bigcodebench.generate
when it became stable. ~My explanation is that there should be some updates on vLLM, which causes such a great discrepancy.~
I'm not sure if it's necessary to update the results of the leaderboard, given that vLLM keeps changing, and so do some model chat templates.
For reference, I attached all the files of Yi 9B Chat here: original_yi_9b_results.zip
I checked other recently evaluated models. They don't have the issues of space-after-comma and space-after-full-stop. I wonder if this issue is Yi-model-specific.
FYI, I'm running CodeQwen as an example to see if there is any degradation. Let me know if you want to check other models :)
Thank you @terryyz! hmm yeah the root cause might be that some change at vllm layer affects LLM outputs, causing the discrepancy.
About whether this issue is Yi-model specific, as far as I have tried, instruct task scores are worse compared to the leaderboard value for other models as well. Unlike Yi-model scores those scores were evaluated w/o docker env, so this difference may be attributed to the environment difference though.
instruct task scores evaluated w/o docker environment:
model | Leaderboard | local evaluation |
---|---|---|
google/codegemma-7b-it | 32.3 | 27.5 (🔻4.8) |
meta-llama/Meta-Llama-3-8B-Instruct | 31.9 | 29.1 (🔻2.8) |
Thanks! @takkyu2 I'll do the evals on these two models and see what I can get.
Hi @takkyu2,
While I'm waiting for the other two models, here's the result for CodeQwen1.5B-7B-Chat on the Complete split. The difference is not that big.
model | Leaderboard | local evaluation |
---|---|---|
Qwen/CodeQwen1.5-7B-Chat | 43.6 | 44.7 (🔺1.1) |
I also noticed there have been quite many discussions in vLLM regarding the inconsistency of greedy decoding: https://github.com/vllm-project/vllm/issues/5898. I generally use a batch size of 5 to speed up the process. I should pin a separate issue for this in our repo. However, I don't expect that the inconsistency will result in a great discrepancy. My current guess is that our observed difference is likely due to the updates in the vLLM version. Also a note: there was a big change in vLLM from v0.4.3
to v0.5.0
on June 12th hmmm, https://github.com/vllm-project/vllm/releases/tag/v0.5.0.
BTW, could you please check the pass rate of the ground truths in your local environment? That will tell you if the great discrepancy is due to the local environment or just the generations. Ideally, the ground-truth pass rate is close to 100%, I have 99.6% on my machine for example.
Okay, I got the following results on my machine using vLLM v0.5.1.
model | Leaderboard | local evaluation |
---|---|---|
google/codegemma-7b-it | 32.3 | 28.3 (🔻4) |
meta-llama/Meta-Llama-3-8B-Instruct | 31.9 | 28.8 (🔻3.1) |
The results are very close to yours, suggesting the decoding inconsistency is minimal. ~The main reason of the degradation should be the changes from v0.4. to 0.5..~
Hi @takkyu2, I did more ablation studies.
TL;DR: The main issue is the transformers
version, while vLLM
still has some inconsistency.
I experimented with different vLLM
versions, and the results didn't change much. So did flash-attn
and triton
. However, I observed a great difference when downgrading the transformers
to v4.40.2
. I remember that I was using v4.40.*
to evaluate the models reported on the arXiv paper.
Specifically, I used Yi-9B-Chat as an example: 4402_yi.zip
subset | Leaderboard | local evaluation |
---|---|---|
complete | 42.4 | 41.8 (🔻0.6) |
instruct | 34.5 | 33.4 (🔻1.1) |
The weird extra spaces disappeared in the attached outputs. I haven't noticed anyone discussing similar issues before. It should be a big issue IMO. However, due to the lack of detailed investigations, I don't know which part of the implementation resulted in such a degradation. Let me know if you'd like to investigate this. Otherwise, we can simply file an issue in the transformers
repo.
Hi @terryyz, thanks a lot for the quick turnaround and spotting the root cause!
I agree with you in that filing this issue to transformers
folks is a good idea. This sounds an unexpected change on the transformers
side and they should know what kind of changes occurred between v4.40.*
and later better than we do.
Thank you again for your tremendous help!
I answered on the thread but am available to fix this asap! sounds bad 😢
Thanks @ArthurZucker! Hope it will be fixed soon. I expect this issue will greatly affect other benchmarks. It should be a big problem, but no one has concretely discussed this...
Hi @takkyu2! Just a note that v0.1.8 has been released with a temporary fix. More details about BigCodeBench-Hard can be found in https://huggingface.co/blog/terryyz/bigcodebench-hard.
Closed this issue for now :)
Thanks a lot @terryyz for addressing the issue, and congratulations to the bigcodebench-hard release 🎉! I will try v0.1.8 when I have enough bandwidth.
Hi team! First things first, thank you for creating this wonderful benchmark! I believe its curation and evaluation required a lot of effort, so I really appreciate it that you open-sourced the datasets and evaluation scripts for the community.
Summary of the issue
I have been trying to reproduce the leaderboard value by running the scripts locally, and I found that the metrics evaluated locally are consistently worse compared to the leaderboard values.
Although I understand that it is very hard to reproduce the exact value from the leaderboard, the difference is rather large: For
01-ai/Yi-1.5-9B-Chat
, the absolute difference ofpass@1
is 6.3 for complete and 4.1 for instruct subsets, respectively.Please let me know if I have made any mistakes on my side or if I can provide further information for diagnosing the issue. Thank you!
Results
Notes
failed to map segment from shared object
during evaluationSteps to reproduce
I ran
01-ai/Yi-1.5-9B-Chat
on A10 GPU to generate LLM responses and evaluated it, using docker images for both steps. The evaluation was done on 2024-07-08 14:00 for complete and 2024-07-08 18:57 for instruct.The generation script:
The evaluation script:
Docker images: