fix: Multiple-E dataset fix go_test.go path for test execution

hitesh-1997 commented 7 months ago

Context along with reproducer described in the issue: Resolves https://github.com/bigcode-project/bigcode-evaluation-harness/issues/224

When I checked the different values of language field can in in the dataset for different languages supported in themultiple.py, seems like all of them have the same name except go (Adding the repro and screenshot below for this statement).

LANGUAGES = [ "py", "sh", "cpp", "cs", "d", "go", "java", "js", "jl", "lua", "pl", "php", "r", "rkt", "rb", "rs", "scala", "swift", "ts"]
for lang in LANGUAGES:
    data = load_dataset('nuprl/MultiPL-E', f'humaneval-{lang}', split='test', revision="d23b094346c5dbda1080a74bb2a24c18adbf7409")
    print(f"languages in multiple-E for {lang}: {set([dt['language'] for dt in data])}")

I tried to track the flow of problem['language'] field which I am changing in the PR to make sure it doesn't affect any other language, it seems like this field is only used in the containerized_eval.py to decide the evaluation script to execute and the file extension, and shouldn't affect other language. I guess the additional fields and seperate handling of the go_test.go was done for different dataset revisions ?

Please let me know if the flow looks good.

Thanks !!

arjunguha commented 7 months ago

I haven't run this, but this looks accurate. It reflects the structure of the original MultiPL-E code.

hitesh-1997 commented 7 months ago

I haven't run this, but this looks accurate. It reflects the structure of the original MultiPL-E code.

Thanks for the response !! Please let me know if I should add any additional checks or sanity to verify the changes. Also, please let me know if I should tag someone else to have a look at the PR :)

hitesh-1997 commented 7 months ago

Thanks a lot for your response !! @arjunguha @loubnabnl I have some additional questions on the way we are post-processing the outputs. It would really help, if you can take sometime to answer those :)

I see that we are using stop-tokens present in the dataset, to extract the final generation code. I noticed in some of the cases:

If the LLM outputs some helper function such as isPrime, the test fails but in reality just including that function can make the test pass. I see similar outputs in the bigcode/MultiPL-E-completions dataset as well. Should we update the post-processing to correct such cases ?
If the LLM use some additional imports such as strings in go or Counter in python to complete the function, such imports are not included in the final completions code, leading to test failures.
Finally, some of the languages like ruby have some very basic stop tokens like \n\n. So if LLM outputs one additional new line in between the function body, this leads to wrong truncation. I find the pass rate to match for GPT-4, after removing such tokens. Just wondering, if this should be updated or generations could be wrong from my end.

bigcode-project / bigcode-evaluation-harness

fix: Multiple-E dataset fix go_test.go path for test execution #225