Open hitesh-1997 opened 7 months ago
I haven't run this, but this looks accurate. It reflects the structure of the original MultiPL-E code.
I haven't run this, but this looks accurate. It reflects the structure of the original MultiPL-E code.
Thanks for the response !! Please let me know if I should add any additional checks or sanity to verify the changes. Also, please let me know if I should tag someone else to have a look at the PR :)
Thanks a lot for your response !! @arjunguha @loubnabnl I have some additional questions on the way we are post-processing the outputs. It would really help, if you can take sometime to answer those :)
I see that we are using stop-tokens present in the dataset, to extract the final generation code. I noticed in some of the cases:
isPrime
, the test fails but in reality just including that function can make the test pass. I see similar outputs in the bigcode/MultiPL-E-completions dataset as well. Should we update the post-processing to correct such cases ? strings
in go or Counter
in python to complete the function, such imports are not included in the final completions code, leading to test failures.ruby
have some very basic stop tokens like \n\n
. So if LLM outputs one additional new line in between the function body, this leads to wrong truncation. I find the pass rate to match for GPT-4, after removing such tokens. Just wondering, if this should be updated or generations could be wrong from my end.
Context along with reproducer described in the issue: Resolves https://github.com/bigcode-project/bigcode-evaluation-harness/issues/224
When I checked the different values of language field can in in the dataset for different languages supported in the
multiple.py
, seems like all of them have the same name exceptgo
(Adding the repro and screenshot below for this statement).I tried to track the flow of problem['language'] field which I am changing in the PR to make sure it doesn't affect any other language, it seems like this field is only used in the containerized_eval.py to decide the evaluation script to execute and the file extension, and shouldn't affect other language. I guess the additional fields and seperate handling of the go_test.go was done for different dataset revisions ?
Please let me know if the flow looks good.
Thanks !!