bigcode-project / bigcode-evaluation-harness

A framework for the evaluation of autoregressive code generation language models.
Apache License 2.0
698 stars 180 forks source link

fix: Multiple-E dataset fix go_test.go path for test execution #225

Open hitesh-1997 opened 2 months ago

hitesh-1997 commented 2 months ago

Context along with reproducer described in the issue: Resolves https://github.com/bigcode-project/bigcode-evaluation-harness/issues/224

When I checked the different values of language field can in in the dataset for different languages supported in themultiple.py, seems like all of them have the same name except go (Adding the repro and screenshot below for this statement).

LANGUAGES = [ "py", "sh", "cpp", "cs", "d", "go", "java", "js", "jl", "lua", "pl", "php", "r", "rkt", "rb", "rs", "scala", "swift", "ts"]
for lang in LANGUAGES:
    data = load_dataset('nuprl/MultiPL-E', f'humaneval-{lang}', split='test', revision="d23b094346c5dbda1080a74bb2a24c18adbf7409")
    print(f"languages in multiple-E for {lang}: {set([dt['language'] for dt in data])}")
image

I tried to track the flow of problem['language'] field which I am changing in the PR to make sure it doesn't affect any other language, it seems like this field is only used in the containerized_eval.py to decide the evaluation script to execute and the file extension, and shouldn't affect other language. I guess the additional fields and seperate handling of the go_test.go was done for different dataset revisions ?

Please let me know if the flow looks good.

Thanks !!

arjunguha commented 2 months ago

I haven't run this, but this looks accurate. It reflects the structure of the original MultiPL-E code.

hitesh-1997 commented 2 months ago

I haven't run this, but this looks accurate. It reflects the structure of the original MultiPL-E code.

Thanks for the response !! Please let me know if I should add any additional checks or sanity to verify the changes. Also, please let me know if I should tag someone else to have a look at the PR :)

hitesh-1997 commented 2 months ago

Thanks a lot for your response !! @arjunguha @loubnabnl I have some additional questions on the way we are post-processing the outputs. It would really help, if you can take sometime to answer those :)

I see that we are using stop-tokens present in the dataset, to extract the final generation code. I noticed in some of the cases: