bigcode-project / bigcode-evaluation-harness

A framework for the evaluation of autoregressive code generation language models.
Apache License 2.0
825 stars 219 forks source link

Update docker image for HumanEvalPack-Synthesize. #279

Closed zongyf02 closed 4 weeks ago

zongyf02 commented 1 month ago

I would like to evaluate generated humanevalsynthesize code in a docker container. I'm running:

docker run -v ./_humanevalsynthesize-${lang}.json:/app/generations.json:ro -it --rm evaluation-harness-multiple \
      --model model \
      --tasks humanevalsynthesize-${lang} \
      --load_generations_path /app/generations.json \
      --allow_code_execution \
      --do_sample False \
      --n_samples 1 > ./${lang}_results.log

The "evaluation-harness" docker image does lacks the dependencies for evaluating languages such as javascript or rust. So I'm using the latest "evaluation-harness-multiple" image.

When evaluating on _humanevalsynthesize-cpp.json, the "pass@1" result is always 0. However, outside of the container, the same generation has a "pass@1" of about 0.3.

When evaluating on _humanevalsynthesize-rust.json, the tasks fails due to FileNotFoundError: [Errno 2] No such file or directory: 'cargo'. The full error log is in rust_results.log. Note that multiple-rs works correctly in that container, but not humanevalsynthesize-rust.

In short, I'd like to evaluate the 6 humanevalpack languages python, js, java, cpp, go, rust in a container but neither evaluation-harness nor evaluation-harness-multiple is fully working.

Thanks.

timrbula commented 1 month ago

hey @zongyf02 we ran into the same issues and built our image like:

FROM ghcr.io/nuprl/multipl-e-evaluation:v3.1

# Set env vars
ENV CARGO_HOME=/.cache/cargo/
ENV GOFLAGS="-mod=mod"

# Go create dir 
RUN mkdir /go

# Rust add cargo
RUN apt-get update && apt-get install -yqq cargo

# C++ add boost lib and ssl lib
RUN apt-get update && apt-get install -yqq libboost-dev libssl-dev

# rest of default Dockerfile
...
zongyf02 commented 4 weeks ago

hey @zongyf02 we ran into the same issues and built our image like:

Works perfectly. Thank you!