CrossCodeEval Results for StarCoder 2

Hi, currently I'm researching the impact of different retrieval-augmented generation (RAG) techniques on the LLM effect. We are attempting to replicate the CrossCodeEval from the "StarCoder 2 and The Stack v2: The Next Generation" paper as a baseline.

However, we have encountered issues in replicating the results stated in section 7.6.2 of the paper using the provided GitHub repository data and code for CrossCodeEval, along with the hyperparameters specified in the section. The paper reports a Code ES of 74.52 and an ID F1 of 68.81 for StarCoder2-7B’s Python code generation, whereas our replicated results showed a Code ES of 67.92 and an ID F1 of 58.08.

We noticed the option to use the BigCode-Evaluation-Harness for testing as mentioned in your repository, but we could not find the CrossCodeEval experiment within the bigcode-project/bigcode-evaluation-harness project. Therefore, we proceeded with the direct use of the open-source GitHub code and dataset for CrossCodeEval, employing the hyperparameters given in section 7.6.2.

My experiment evironment is:

A100 40G*8 DGX node
ubuntu 20.04
cuda 12.1
torch 2.1.2

Could you please provide any insights or additional guidelines that might help us better replicate the benchmark results? Any assistance or further details you could offer would be greatly appreciated.

Thank you for your time and support.

bigcode-project / starcoder2

CrossCodeEval Results for StarCoder 2 #20