(Tokenization) Performance Degradation Starting from Transformers v4.42.*

takkyu2 commented 2 weeks ago

Hi team! First things first, thank you for creating this wonderful benchmark! I believe its curation and evaluation required a lot of effort, so I really appreciate it that you open-sourced the datasets and evaluation scripts for the community.

Summary of the issue

I have been trying to reproduce the leaderboard value by running the scripts locally, and I found that the metrics evaluated locally are consistently worse compared to the leaderboard values.

Although I understand that it is very hard to reproduce the exact value from the leaderboard, the difference is rather large: For 01-ai/Yi-1.5-9B-Chat, the absolute difference of pass@1 is 6.3 for complete and 4.1 for instruct subsets, respectively.

Please let me know if I have made any mistakes on my side or if I can provide further information for diagnosing the issue. Thank you!

Results

Run code evaluation script against pregenerated LLM outputs -> ✅ I can reproduce the leaderboard values:

subset	Leaderboard	local evaluation
complete	42.4	41.9
instruct	34.5	34.4

Run generation and code evaluation scripts from scratch with prebuilt docker images on A10 GPU -> ❌ I cannot reproduce the leaderboard values:

subset	Leaderboard	local evaluation
complete	42.4	36.1 (🔻6.3)
instruct	34.5	30.4 (🔻4.1)

Notes

The number of problems timed-out is as follows, and it cannot account for the discrepancy:

subset	Leaderboard	local evaluation
complete	17	15
instruct	17	15

I increased the memory limit and didn't get errors like failed to map segment from shared object during evaluation

Steps to reproduce

I ran 01-ai/Yi-1.5-9B-Chat on A10 GPU to generate LLM responses and evaluated it, using docker images for both steps. The evaluation was done on 2024-07-08 14:00 for complete and 2024-07-08 18:57 for instruct.

The generation script:

#!/bin/bash

export CUDA_VISIBLE_DEVICES=0
docker run --gpus '"device=0"' -v $(pwd):/app -t bigcodebench/bigcodebench-generate:latest --model "01-ai/Yi-1.5-9B-Chat" --subset "instruct" --greedy --bs "1" --temperature "0" --n_samples "1" --backend vllm --tp "1"

The evaluation script:

#!/bin/bash

export CUDA_VISIBLE_DEVICES=0
docker run -it --entrypoint bigcodebench.sanitize -v $(pwd):/app:rw bigcodebench/bigcodebench-evaluate:latest --samples 01-ai--Yi-1.5-9B-Chat--bigcodebench-instruct--vllm-0-1.jsonl --calibrate
docker run -m 16g -v $(pwd):/app:rw bigcodebench/bigcodebench-evaluate:latest --subset instruct --samples 01-ai--Yi-1.5-9B-Chat--bigcodebench-instruct--vllm-0-1-sanitized-calibrated.jsonl --parallel 32

Docker images:

REPOSITORY                           TAG       IMAGE ID       CREATED      SIZE
bigcodebench/bigcodebench-generate   latest    eec1e77e88eb   6 days ago   24.6GB
bigcodebench/bigcodebench-evaluate   latest    6ff203339e91   6 days ago   5.44GB

terryyz commented 2 weeks ago

Hi @takkyu2, sorry to hear this!

I'll spend some time today and tomorrow looking into it. Meanwhile, would you mind providing the outputs you generated?

takkyu2 commented 2 weeks ago

Thank you very much for your help! I attached the json files of local generation/eval results and leaderboard eval results below: results.zip

terryyz commented 2 weeks ago

Hi @takkyu2, I re-evaluated the provided files here: yi_results.zip

I only got ~6 timeout tasks. Most of your "timeout" tasks are mainly related to the sklearn dataset download and modeling. I'd expect a longer time required to pass the tests. The current v0.1.7.0 release only focuses on the evaluation speed, so I set the time limit as 120 seconds for both ground truths and generated outputs. I've extended the time limit to 240 seconds in the upcoming v0.1.8 release: #17.

For the re-evaluated results, I got the same results as the reported ones based on the pre-generated outputs. Regarding your local outputs, I got the same scores as yours, though the number of timeout tasks is significantly reduced.

One thing I feel quite strange about is the extra spaces after commas and full stops in the prompts of your local outputs. For example, you got random. shuffle in the docstring, but the actual one is random.shuffle. I assume that the difference may explain the discrepancy. I didn't get the extra space during the generation. I'm now doing a new generation, and it should be finished shortly. ~I have still found no such issues in the newly generated outputs so far.~ I'm using vLLM v0.5.1, and the one in the docker should be v0.5.0.

Would you mind doing a new set of generations w/o the docker image and seeing if the extra spaces still exist? I doubt if this is due to some incompatibility of your environment. The original prompts do not have these spaces.

takkyu2 commented 2 weeks ago

Hi @terryyz, thank you for the very detailed analysis, this helps a lot! Let me rerun the generation without the docker environment.

I didn't realize that space-after-comma issue, this is strange... I will check my environment like the library versions to find some issue there.

terryyz commented 2 weeks ago

Sorry @takkyu2, here's a correction based on my newly generated outputs (new_yi_results.zip):

After detailed comparisons, I did find some extra spaces still exist in the generated docstrings.
The newly generated outputs on BigCodeBench Complete got 36.0, which is pretty similar to what you got.

The original generation was done on May 22nd, as documented on my server. The framework should be based on this version: https://github.com/bigcode-project/bigcodebench/tree/3cdf0ea6484c6c4fcb6ef26ed4bf3c7e7be1b552, when the framework was originally called WildCodeBench. There was not much difference between generate.py and model.py except for the module name changes. I barely touched bigcodebench.generate when it became stable. ~My explanation is that there should be some updates on vLLM, which causes such a great discrepancy.~

I'm not sure if it's necessary to update the results of the leaderboard, given that vLLM keeps changing, and so do some model chat templates.

For reference, I attached all the files of Yi 9B Chat here: original_yi_9b_results.zip

I checked other recently evaluated models. They don't have the issues of space-after-comma and space-after-full-stop. I wonder if this issue is Yi-model-specific.

terryyz commented 2 weeks ago

FYI, I'm running CodeQwen as an example to see if there is any degradation. Let me know if you want to check other models :)

takkyu2 commented 2 weeks ago

Thank you @terryyz! hmm yeah the root cause might be that some change at vllm layer affects LLM outputs, causing the discrepancy.

About whether this issue is Yi-model specific, as far as I have tried, instruct task scores are worse compared to the leaderboard value for other models as well. Unlike Yi-model scores those scores were evaluated w/o docker env, so this difference may be attributed to the environment difference though.

instruct task scores evaluated w/o docker environment:

model	Leaderboard	local evaluation
google/codegemma-7b-it	32.3	27.5 (🔻4.8)
meta-llama/Meta-Llama-3-8B-Instruct	31.9	29.1 (🔻2.8)

terryyz commented 2 weeks ago

Thanks! @takkyu2 I'll do the evals on these two models and see what I can get.

terryyz commented 2 weeks ago

Hi @takkyu2,

While I'm waiting for the other two models, here's the result for CodeQwen1.5B-7B-Chat on the Complete split. The difference is not that big.

model	Leaderboard	local evaluation
Qwen/CodeQwen1.5-7B-Chat	43.6	44.7 (🔺1.1)

I also noticed there have been quite many discussions in vLLM regarding the inconsistency of greedy decoding: https://github.com/vllm-project/vllm/issues/5898. I generally use a batch size of 5 to speed up the process. I should pin a separate issue for this in our repo. However, I don't expect that the inconsistency will result in a great discrepancy. My current guess is that our observed difference is likely due to the updates in the vLLM version. Also a note: there was a big change in vLLM from v0.4.3 to v0.5.0 on June 12th hmmm, https://github.com/vllm-project/vllm/releases/tag/v0.5.0.

BTW, could you please check the pass rate of the ground truths in your local environment? That will tell you if the great discrepancy is due to the local environment or just the generations. Ideally, the ground-truth pass rate is close to 100%, I have 99.6% on my machine for example.

terryyz commented 2 weeks ago

Okay, I got the following results on my machine using vLLM v0.5.1.

model	Leaderboard	local evaluation
google/codegemma-7b-it	32.3	28.3 (🔻4)
meta-llama/Meta-Llama-3-8B-Instruct	31.9	28.8 (🔻3.1)

The results are very close to yours, suggesting the decoding inconsistency is minimal. ~The main reason of the degradation should be the changes from v0.4. to 0.5..~

terryyz commented 2 weeks ago

Hi @takkyu2, I did more ablation studies.

TL;DR: The main issue is the transformers version, while vLLM still has some inconsistency.

I experimented with different vLLM versions, and the results didn't change much. So did flash-attn and triton. However, I observed a great difference when downgrading the transformers to v4.40.2. I remember that I was using v4.40.* to evaluate the models reported on the arXiv paper.

Specifically, I used Yi-9B-Chat as an example: 4402_yi.zip

subset	Leaderboard	local evaluation
complete	42.4	41.8 (🔻0.6)
instruct	34.5	33.4 (🔻1.1)

The weird extra spaces disappeared in the attached outputs. I haven't noticed anyone discussing similar issues before. It should be a big issue IMO. However, due to the lack of detailed investigations, I don't know which part of the implementation resulted in such a degradation. Let me know if you'd like to investigate this. Otherwise, we can simply file an issue in the transformers repo.

takkyu2 commented 2 weeks ago

Hi @terryyz, thanks a lot for the quick turnaround and spotting the root cause!

I agree with you in that filing this issue to transformers folks is a good idea. This sounds an unexpected change on the transformers side and they should know what kind of changes occurred between v4.40.* and later better than we do.

Thank you again for your tremendous help!

ArthurZucker commented 2 weeks ago

I answered on the thread but am available to fix this asap! sounds bad 😢

terryyz commented 2 weeks ago

Thanks @ArthurZucker! Hope it will be fixed soon. I expect this issue will greatly affect other benchmarks. It should be a big problem, but no one has concretely discussed this...

terryyz commented 1 week ago

Hi @takkyu2! Just a note that v0.1.8 has been released with a temporary fix. More details about BigCodeBench-Hard can be found in https://huggingface.co/blog/terryyz/bigcodebench-hard.

terryyz commented 1 week ago

Closed this issue for now :)

takkyu2 commented 1 week ago

Thanks a lot @terryyz for addressing the issue, and congratulations to the bigcodebench-hard release 🎉! I will try v0.1.8 when I have enough bandwidth.

bigcode-project / bigcodebench