Raw code projects of the curated data

Hi, thanks for the amazing new work and congratulations on the NeurIPS acceptance!

From my current understanding, the prompts for the cceval datasets (e.g., data/crosscodeeval_data/python/line_completion{*}.jsonl are pre-processed and thus fixed. I am interested to see the actual full contexts (as if a developer can access all code within the project under development), which would broaden the use of cceval to evaluate some retrieval-based technique (e.g., self-RAG).

I found there is a metadata field for each of the line completions such as:

{'task_id': 'project_cc_python/1584',
 'repository': 'obahamonde-aiofauna-67993d2',
 'file': 'aiofauna/llm/schemas.py',
 'context_start_lineno': 0,
 'groundtruth_start_lineno': 85,
 'right_context_start_lineno': 86}

I am assuming obahamonde-aiofauna-67993d2 includes the information of owner, project and commit id to indicates the exact raw project. Nonetheless, I am curious if there is any plans on supporting a more low-level dataset format for cceval where each item can include the whole structure of the project (e.g., may use a docker image to pre-install all projects and point out the project root path). Thanks!

amazon-science / cceval

Raw code projects of the curated data #8