bigcode-project / bigcode-evaluation-harness

A framework for the evaluation of autoregressive code generation language models.
Apache License 2.0
744 stars 193 forks source link

Attempt to make MultiPl-E's evaluation parallelization over all completions at once rather than just over each problem. #86

Open esslushy opened 1 year ago

esslushy commented 1 year ago

I have attempted to merge the loops for problem evaluation from a loop of problems to a parallel execution of all their completions to parallel execution of all completions. This ends up being less space efficient, but allows for greater parallelization.

arjunguha commented 1 year ago

wow, its the full pizza

loubnabnl commented 1 year ago

Thanks a lot Claire for adding this! Did you run some tests to measure evaluation time of the problems execution before and after the fix? We're interested in how this impacts low-sample (e.g n_samples=5) and high samples generation (e.g nb_samples=200 ) You can run

accelerate launch main.py \
    --model bigcode/santacoder\
    --task multiple-py \
    # n_samples 200
    --n_samples 5 \
    --batch_size 10 \
    --max_length_generation 512 \
    --temperature 0.2 \
    --trust_remote_code \
        --allow_code_execution 
    --save_generations 

same generations can be used for both experiments with (by providing --load_generations_path the second time)

esslushy commented 1 year ago

Sure! I have just run the tests and got the following results

Old Implementation evaluation time for: 5 samples: real 0m25.058s user 0m15.062s sys 0m7.057s 200 samples: real 2m6.746s user 3m38.928s sys 0m36.204s

New Implementation evaluation time for: 5 samples: real 0m9.286s user 0m14.529s sys 0m7.347s 200 samples: real 1m0.409s user 2m24.066s sys 0m37.152s

It appears parallelizing in this form even has an effect when there are many completions per problem. This was run on a machine running a Intel(R) Xeon(R) Gold 6342 CPU @ 2.80GHz with 96 cores. I used the same generations for both varying the number of samples.

loubnabnl commented 1 year ago

Thanks Claire, I run the code on my side, there's indeed a speedup but the evaluation score significantly decreases with the new implementation (for the same generations), maybe some problems aren't executed correctly?

Evaluated 161 problems in 189.46s with old implementation
{
  "multiple-py": {
    "pass@1": 0.17888198757763973
  },
  "config": {
    "model": "bigcode/santacoder",
    "temperature": 0.2,
    "n_samples": 5
  }
}
Evaluated 161 problems in 76.72744512557983 seconds with new implementation
{
  "multiple-py": {
    "pass@1": 0.084472049689441
  },
  "config": {
    "model": "bigcode/santacoder",
    "temperature": 0.2,
    "n_samples": 5
  }
}

command

accelerate launch main.py --model bigcode/santacoder --task multiple-py --n_samples 5 --batch_size 10 --max_length_generation 512  --temperature 0.2 --trust_remote_code   --allow_code_execution 

Do you reproduce this?

esslushy commented 1 year ago

That's strange. When I use the same generations file on evaluation only mode I get the same results, but it seems when you rerun generations it is different. However, on my machine I am not able to replicate this. Both using the same generations on eval only mode and rerunning the entire process with newly made generations. Both were approximately 0.18. Do you have the script you ran to test both? Perhaps I could replicate it with that, but going by command line only I did not see this issue.

For new version

Saved 161 problems in /tmp/tmp4s0k7ayn for evaluation, each problem has 5 completions
{
  "multiple-py": {
    "pass@1": 0.18012422360248448
  },
  "config": {
    "model": "bigcode/santacoder",
    "temperature": 0.2,
    "n_samples": 5
  }
}

For old version

{
  "multiple-py": {
    "pass@1": 0.18012422360248448
  },
  "config": {
    "model": "bigcode/santacoder",
    "temperature": 0.2,
    "n_samples": 5
  }
}

However, while it is consistent on the same generations, I am not sure why it is consistent between versions so much on different generations. There might be an issue with my testing, so let me know what you did to test the different versions.

loubnabnl commented 1 year ago

Here’s my command:

 accelerate launch main.py --model bigcode/santacoder --task multiple-py --n_samples 5 --batch_size 5 --max_length_generation 512  --temperature 0.2 --trust_remote_code   --allow_code_execution  --save_generations --save_generations_path test_speed.json

Which gives with the current implementation


{                                                                                                                                                      
  "multiple-py": {                                                                                                                                     
    "pass@1": 0.08819875776397515                                                                                                                      
  },                                                                                                                                                   
  "config": {                                                                                                                                          
    "model": "bigcode/santacoder",                                                                                                                     
    "temperature": 0.2,                                                                                                                                
    "n_samples": 5                                                                                                                                     
  }                                                                                                                                                    
}

I also tried just doing the execution part and got the same score, I uploaded the generations here if you want to try the command below:

 accelerate launch main.py --model bigcode/santacoder --task multiple-py --n_samples 5  --trust_remote_code   --allow_code_execution  --load_generations_path test_speed.json
esslushy commented 1 year ago

I have a suspicion that accelerate is doing some caching behind the scenes which is preventing me from seeing the changes and possibly the bug. I am going to experiment with it some more and see if I can replicate it.

esslushy commented 1 year ago

I have downloaded your generations and ran on my implementation on my machine and got the following:

Saved 161 problems in /tmp/tmp8qnelpp2 for evaluation, each problem has 5 completions
{
  "multiple-py": {
    "pass@1": 0.18509316770186335
  },
  "config": {
    "model": "bigcode/santacoder",
    "revision": null,
    "temperature": 0.2,
    "n_samples": 5
  }
}

Could you confirm if running the normal bigcode implementation gives the same result on your machine. I am still unable to replicate this issue. My command was:

python main.py --model bigcode/santacoder --temperature 0.2 --load_generations_path loubna_gens.json  --n_samples 5 --allow_code_execution --tasks multiple-py

In order to test with your generations

cassanof commented 1 year ago

To me it looks like you are doing concurrent writes to test_results["results"]. you may want to add mutex in test_results["lock"]

esslushy commented 1 year ago

I have successfully replicated the issue. I have found something strange. The tests don't line up properly with the problems which is causing the evaluator to fail. As this is done synchronously, I am not sure what is happening, but I will investigate further.

esslushy commented 1 year ago

I have attempted to fix the issues. I believe it was due to all the IO that was happening and causing overwrites of each other. I wasn't quite sure why we had to write all results to a tmp file and then read from them again as it appears to work fine without it. Perhaps this is something I am missing and it is crucial, so let me know.

esslushy commented 1 year ago

Just realized I forgot to provide follow up details, I have run the command:

accelerate launch main.py --model bigcode/santacoder --task multiple-py --n_samples 5 --batch_size 5 --max_length_generation 512  --temperature 0.2 --trust_remote_code   --allow_code_execution  --save_generations --save_generations_path test_speed.json

and got

{
  "multiple-py": {
    "pass@1": 0.17888198757763973
  },
  "config": {
    "model": "bigcode/santacoder",
    "revision": null,
    "temperature": 0.2,
    "n_samples": 5
  }
}

This looks right to me, but let me know if you can't recreate this.

loubnabnl commented 1 year ago

Thanks for the update Claire, but I still get discrepancies compared to the original implementation, are you sure your results aren’t cached from some previous execution with the original code?

{
  "multiple-py": {
    "pass@1": 0.09192546583850932
  },
  "config": {
    "model": "bigcode/santacoder",
    "revision": null,
    "temperature": 0.2,
    "n_samples": 5
  }
}
{
  "multiple-py": {
    "pass@1": 0.18385093167701863
  },
  "config": {
    "model": "bigcode/santacoder",
    "revision": null,
    "temperature": 0.2,
    "n_samples": 5
  }
}

The code looks good to me so i'm not sure either what's causing the issue. If we can't fix it we can probably stick to the original implementation since we now usually use 50 samples per problem for pass@1 computation for which the execution is fast.

esslushy commented 1 year ago

I added a new lock to the test_results dictionary that everything is added to. While that is not used in the problem only version, I believe it is necessary because of how I set up the dictionaries. I am still unable to replicate your issue, but I suspect that this should help as it will make things more consistent.

For the caching idea, it is quite possible, but I have deleted the old results and regenerated. I even switched to a new virtualenv to make sure it worked. Is there another way it could be caching I am not aware of? Please let me know and I will see if clearing everything makes it work.

esslushy commented 1 year ago

I have run evaluation using the container setup with podman running the following two commands: To build: podman build --file=Dockerfile-multiple --tag=evaluation-harness-multiple To run: podman run -v /home/claire.schlesinger/bigcode-evaluation-harness/test_speed.json:/test_speed.json:z -it evaluation-harness-multiple python3 main.py --model bigcode/santacoder --tasks multiple-py --load_generations_path /test_speed.json --allow_code_execution --temperature 0.2 --n_samples 5

I got the following results:

Assembled 161 problems for evaluation, each problem has 5 completions
{
  "multiple-py": {
    "pass@1": 0.17888198757763973
  },
  "config": {
    "model": "bigcode/santacoder",
    "revision": null,
    "temperature": 0.2,
    "n_samples": 5
  }
}

This was after I cleaned my cache, and you can find the generations here.