Leolty / repobench

✨ RepoBench: Benchmarking Repository-Level Code Auto-Completion Systems - ICLR 2024
https://arxiv.org/abs/2306.03091
Creative Commons Attribution 4.0 International
132 stars 7 forks source link

Clarification Needed on Dataset Fields in Repobench #14

Open ramsey-coding opened 4 months ago

ramsey-coding commented 4 months ago

I am currently working with the Repobench dataset and have encountered some confusion regarding the meaning and usage of specific fields. The documentation and dataset format appear to be somewhat vague.

Could you please provide detailed explanations for the following fields?

  1. file_path:

    • Does this field indicate the full path to the specific file within the repository?
  2. context:

    • What is meant by "context"?
    • I presume context from imports?
  3. import_statement:

  4. code:

  5. next_line:

    • this refers to the next line that follows the provided code snippet i.e. what comes after the code ?

Additionally, is the "import_statement + code" represent the code under edit?

Here is a link to a specific example from the dataset for reference: Repobench Dataset Example

Thank you for your assistance in clarifying these points.

Leolty commented 4 months ago

Hi,

Thank you for your interest! Here are detailed explanations for the specific fields you've mentioned:

file_path

You are correct. This field indicates the full path to the specific file within the repo.

context

The context refers to the definitions of the imported elements within the repository. In the example provided, these are concatenated into a long string (commented) for clarity. In the latest versions, a dictionary format is used instead of a concatenated string.

import_statement

This field contains the parsed import statements from the current file.

code

This field represents the code currently in the file that you are working on.

next_line

This refers to the line of code that follows the provided code snippet. Essentially, it is the next line the model expected to generate if you send context + import-statement + code.

I think your understanding is generally correct. In the dataset you give (which is a earlier version though), the context, import statements, and code are concatenated to form the prompt sent to the code model. The next line represents the ground truth that the model is expected to generate.