Open esslushy opened 1 year ago
wow, its the full pizza
Thanks a lot Claire for adding this! Did you run some tests to measure evaluation time of the problems execution before and after the fix? We're interested in how this impacts low-sample (e.g n_samples=5
) and high samples generation (e.g nb_samples=200
)
You can run
accelerate launch main.py \
--model bigcode/santacoder\
--task multiple-py \
# n_samples 200
--n_samples 5 \
--batch_size 10 \
--max_length_generation 512 \
--temperature 0.2 \
--trust_remote_code \
--allow_code_execution
--save_generations
same generations can be used for both experiments with (by providing --load_generations_path
the second time)
Sure! I have just run the tests and got the following results
Old Implementation evaluation time for: 5 samples: real 0m25.058s user 0m15.062s sys 0m7.057s 200 samples: real 2m6.746s user 3m38.928s sys 0m36.204s
New Implementation evaluation time for: 5 samples: real 0m9.286s user 0m14.529s sys 0m7.347s 200 samples: real 1m0.409s user 2m24.066s sys 0m37.152s
It appears parallelizing in this form even has an effect when there are many completions per problem. This was run on a machine running a Intel(R) Xeon(R) Gold 6342 CPU @ 2.80GHz with 96 cores. I used the same generations for both varying the number of samples.
Thanks Claire, I run the code on my side, there's indeed a speedup but the evaluation score significantly decreases with the new implementation (for the same generations), maybe some problems aren't executed correctly?
Evaluated 161 problems in 189.46s with old implementation
{
"multiple-py": {
"pass@1": 0.17888198757763973
},
"config": {
"model": "bigcode/santacoder",
"temperature": 0.2,
"n_samples": 5
}
}
Evaluated 161 problems in 76.72744512557983 seconds with new implementation
{
"multiple-py": {
"pass@1": 0.084472049689441
},
"config": {
"model": "bigcode/santacoder",
"temperature": 0.2,
"n_samples": 5
}
}
command
accelerate launch main.py --model bigcode/santacoder --task multiple-py --n_samples 5 --batch_size 10 --max_length_generation 512 --temperature 0.2 --trust_remote_code --allow_code_execution
Do you reproduce this?
That's strange. When I use the same generations file on evaluation only mode I get the same results, but it seems when you rerun generations it is different. However, on my machine I am not able to replicate this. Both using the same generations on eval only mode and rerunning the entire process with newly made generations. Both were approximately 0.18. Do you have the script you ran to test both? Perhaps I could replicate it with that, but going by command line only I did not see this issue.
For new version
Saved 161 problems in /tmp/tmp4s0k7ayn for evaluation, each problem has 5 completions
{
"multiple-py": {
"pass@1": 0.18012422360248448
},
"config": {
"model": "bigcode/santacoder",
"temperature": 0.2,
"n_samples": 5
}
}
For old version
{
"multiple-py": {
"pass@1": 0.18012422360248448
},
"config": {
"model": "bigcode/santacoder",
"temperature": 0.2,
"n_samples": 5
}
}
However, while it is consistent on the same generations, I am not sure why it is consistent between versions so much on different generations. There might be an issue with my testing, so let me know what you did to test the different versions.
Here’s my command:
accelerate launch main.py --model bigcode/santacoder --task multiple-py --n_samples 5 --batch_size 5 --max_length_generation 512 --temperature 0.2 --trust_remote_code --allow_code_execution --save_generations --save_generations_path test_speed.json
Which gives with the current implementation
{
"multiple-py": {
"pass@1": 0.08819875776397515
},
"config": {
"model": "bigcode/santacoder",
"temperature": 0.2,
"n_samples": 5
}
}
I also tried just doing the execution part and got the same score, I uploaded the generations here if you want to try the command below:
accelerate launch main.py --model bigcode/santacoder --task multiple-py --n_samples 5 --trust_remote_code --allow_code_execution --load_generations_path test_speed.json
I have a suspicion that accelerate is doing some caching behind the scenes which is preventing me from seeing the changes and possibly the bug. I am going to experiment with it some more and see if I can replicate it.
I have downloaded your generations and ran on my implementation on my machine and got the following:
Saved 161 problems in /tmp/tmp8qnelpp2 for evaluation, each problem has 5 completions
{
"multiple-py": {
"pass@1": 0.18509316770186335
},
"config": {
"model": "bigcode/santacoder",
"revision": null,
"temperature": 0.2,
"n_samples": 5
}
}
Could you confirm if running the normal bigcode implementation gives the same result on your machine. I am still unable to replicate this issue. My command was:
python main.py --model bigcode/santacoder --temperature 0.2 --load_generations_path loubna_gens.json --n_samples 5 --allow_code_execution --tasks multiple-py
In order to test with your generations
To me it looks like you are doing concurrent writes to test_results["results"]
. you may want to add mutex in test_results["lock"]
I have successfully replicated the issue. I have found something strange. The tests don't line up properly with the problems which is causing the evaluator to fail. As this is done synchronously, I am not sure what is happening, but I will investigate further.
I have attempted to fix the issues. I believe it was due to all the IO that was happening and causing overwrites of each other. I wasn't quite sure why we had to write all results to a tmp file and then read from them again as it appears to work fine without it. Perhaps this is something I am missing and it is crucial, so let me know.
Just realized I forgot to provide follow up details, I have run the command:
accelerate launch main.py --model bigcode/santacoder --task multiple-py --n_samples 5 --batch_size 5 --max_length_generation 512 --temperature 0.2 --trust_remote_code --allow_code_execution --save_generations --save_generations_path test_speed.json
and got
{
"multiple-py": {
"pass@1": 0.17888198757763973
},
"config": {
"model": "bigcode/santacoder",
"revision": null,
"temperature": 0.2,
"n_samples": 5
}
}
This looks right to me, but let me know if you can't recreate this.
Thanks for the update Claire, but I still get discrepancies compared to the original implementation, are you sure your results aren’t cached from some previous execution with the original code?
accelerate launch main.py --model bigcode/santacoder --task multiple-py --n_samples 5 --batch_size 5 --max_length_generation 512 --temperature 0.2 --trust_remote_code --allow_code_execution --save_generations --save_generations_path test_speed.json
{
"multiple-py": {
"pass@1": 0.09192546583850932
},
"config": {
"model": "bigcode/santacoder",
"revision": null,
"temperature": 0.2,
"n_samples": 5
}
}
accelerate launch main.py --model bigcode/santacoder --task multiple-py --n_samples 5 --batch_size 5 --max_length_generation 512 --temperature 0.2 --trust_remote_code --allow_code_execution --load_generations_path /fsx/loubna/code/dev/claire/new/bigcode-evaluation-harness/test_speed.json
{
"multiple-py": {
"pass@1": 0.18385093167701863
},
"config": {
"model": "bigcode/santacoder",
"revision": null,
"temperature": 0.2,
"n_samples": 5
}
}
The code looks good to me so i'm not sure either what's causing the issue. If we can't fix it we can probably stick to the original implementation since we now usually use 50 samples per problem for pass@1
computation for which the execution is fast.
I added a new lock to the test_results dictionary that everything is added to. While that is not used in the problem only version, I believe it is necessary because of how I set up the dictionaries. I am still unable to replicate your issue, but I suspect that this should help as it will make things more consistent.
For the caching idea, it is quite possible, but I have deleted the old results and regenerated. I even switched to a new virtualenv to make sure it worked. Is there another way it could be caching I am not aware of? Please let me know and I will see if clearing everything makes it work.
I have run evaluation using the container setup with podman running the following two commands:
To build: podman build --file=Dockerfile-multiple --tag=evaluation-harness-multiple
To run: podman run -v /home/claire.schlesinger/bigcode-evaluation-harness/test_speed.json:/test_speed.json:z -it evaluation-harness-multiple python3 main.py --model bigcode/santacoder --tasks multiple-py --load_generations_path /test_speed.json --allow_code_execution --temperature 0.2 --n_samples 5
I got the following results:
Assembled 161 problems for evaluation, each problem has 5 completions
{
"multiple-py": {
"pass@1": 0.17888198757763973
},
"config": {
"model": "bigcode/santacoder",
"revision": null,
"temperature": 0.2,
"n_samples": 5
}
}
This was after I cleaned my cache, and you can find the generations here.
I have attempted to merge the loops for problem evaluation from a loop of problems to a parallel execution of all their completions to parallel execution of all completions. This ends up being less space efficient, but allows for greater parallelization.