SalesforceAIResearch / CodeChain

Official code for the paper "CodeChain: Towards Modular Code Generation Through Chain of Self-revisions with Representative Sub-modules"
Apache License 2.0
35 stars 4 forks source link

Inquiry about evaluation process #1

Open Alex-HaochenLi opened 8 months ago

Alex-HaochenLi commented 8 months ago

Hello, I am very excited to read CodeChain. However, I have a question about the evaluation process in this repo.

It seems that during evaluation you test the code on the test cases from test_example_tests.pkl, but not the private test cases from APPS.

Are the pass@1 results reported on the paper based on private test cases? Thank you for your clarification.

huanhuan6666 commented 7 months ago

@Alex-HaochenLi

Same question. Have you received clarification?

Hello, I am very excited to read CodeChain. However, I have a question about the evaluation process in this repo.

It seems that during evaluation you test the code on the test cases from test_example_tests.pkl, but not the private test cases from APPS.

Are the pass@1 results reported on the paper based on private test cases? Thank you for your clarification.

Alex-HaochenLi commented 7 months ago

@Alex-HaochenLi

Same question. Have you received clarification?

Hello, I am very excited to read CodeChain. However, I have a question about the evaluation process in this repo. It seems that during evaluation you test the code on the test cases from test_example_tests.pkl, but not the private test cases from APPS. Are the pass@1 results reported on the paper based on private test cases? Thank you for your clarification.

Not yet :)

huanhuan6666 commented 7 months ago

@Alex-HaochenLi Thank you. In fact, when I was inspecting the evaluation_codechain.sh file, I noticed the following:

# Test by hidden test cases 
python src/evaluate.py --save_gen_path $output_path --eval_split $split

In src/evaluate.py, when example_test_path is not specified, it doesn't load the {split}_example_tests.pkl file. Instead, it ultimately enters utils_evaluate.py and uses example['input_output'] as the test case. In the test set of questions in codeparrot/apps, input_output represents private test cases.