amazon-science / cceval

CrossCodeEval: A Diverse and Multilingual Benchmark for Cross-File Code Completion (NeurIPS 2023)
https://crosscodeeval.github.io/
Apache License 2.0
103 stars 16 forks source link

Raw code projects of the curated data #8

Closed ganler closed 8 months ago

ganler commented 8 months ago

Hi, thanks for the amazing new work and congratulations on the NeurIPS acceptance!

From my current understanding, the prompts for the cceval datasets (e.g., data/crosscodeeval_data/python/line_completion{*}.jsonl are pre-processed and thus fixed. I am interested to see the actual full contexts (as if a developer can access all code within the project under development), which would broaden the use of cceval to evaluate some retrieval-based technique (e.g., self-RAG).

I found there is a metadata field for each of the line completions such as:

{'task_id': 'project_cc_python/1584',
 'repository': 'obahamonde-aiofauna-67993d2',
 'file': 'aiofauna/llm/schemas.py',
 'context_start_lineno': 0,
 'groundtruth_start_lineno': 85,
 'right_context_start_lineno': 86}

I am assuming obahamonde-aiofauna-67993d2 includes the information of owner, project and commit id to indicates the exact raw project. Nonetheless, I am curious if there is any plans on supporting a more low-level dataset format for cceval where each item can include the whole structure of the project (e.g., may use a docker image to pre-install all projects and point out the project root path). Thanks!

zijwang commented 8 months ago

Please email us for this. Thanks.