bigcode-project / bigcode-evaluation-harness

A framework for the evaluation of autoregressive code generation language models.
Apache License 2.0
771 stars 201 forks source link

MultiPL-E Integration #12

Closed loubnabnl closed 1 year ago

loubnabnl commented 1 year ago

As part of the integration of MultiPL-E benchmark create Dockerfile/Docker image with all dependencies required to execute the code generations for different programming languages

arjunguha commented 1 year ago

Tagging @ytzi and @canders1. Let's do this together. I actually have a little evaluation container here in progress, that only supports Racket (never mind why!)

Here my high-level sketch:

  1. We build a container that runs with --network none and supports all 18+ languages. (Let's also get some of the others to work that are in the Stack.)
  2. It receives the name of a directory/file that is read-only that contains JSON completions. ([example file])(https://github.com/nuprl/MultiPL-E/blob/mbpp/experiments/go-davinci-0.2-keep/HumanEval_0_has_close_elements.json)
  3. It receives the name of a directory (writable) where it emits the result of a run. ([example file])(https://github.com/nuprl/MultiPL-E/blob/mbpp/experiments/go-davinci-0.2-keep/HumanEval_0_has_close_elements.results.json)

I think, Docker/Podman will allow both the directories above to be the same directory if desired.

I think we should do this in a separate repository that omits the analysis scripts, inference running, datasets, and results. BigCode will have their own analyses/inference approach.

loubnabnl commented 1 year ago

Yes it makes sense to do this in a separate repository (maybe use the original MultiPL-E repo?), we can use this issue to discuss and track the progress of the task.

In any case we will need to add the metrics to evaluate and to this repo, and then add the Dockerfile when it's ready to the setup instructions

arjunguha commented 1 year ago

This is now in progress in here:

https://github.com/nuprl/MultiPL-E/tree/only_code/evaluation

When it is a little more mature, we will upload a container image. For now, you have to build it yourself. The makefile has commands to build the container and to test:

https://github.com/nuprl/MultiPL-E/blob/only_code/evaluation/Makefile#L5

(You can replace podman with docker, either should work.)

loubnabnl commented 1 year ago

closing this issue as MultiPL-E was integrated in https://github.com/bigcode-project/bigcode-evaluation-harness/pull/44 🥳