Gather good samples from different Repos (we need to have different Repos to make samples more representative)
llama.cpp, LLamagator, Rails
Consider Public and Private repos as well.
What would be the criteria to select Repos and PRs from those repos
Manually annotate those PR so we can turn it into a Dataset
The goal is to have 40 examples with diverse set of repositories. At the moment we are mostly missing smaller sized projects (in terms of the codebase size). We will publish CSV file to hugging face.
This is just a list of selected repositories that we will operate on during initial development, this ticket isn't about fetching the data from those PRs.
Gather good samples from different Repos (we need to have different Repos to make samples more representative)
llama.cpp, LLamagator, Rails
Consider Public and Private repos as well.
What would be the criteria to select Repos and PRs from those repos
Manually annotate those PR so we can turn it into a Dataset
The goal is to have 40 examples with diverse set of repositories. At the moment we are mostly missing smaller sized projects (in terms of the codebase size). We will publish CSV file to hugging face.
This is just a list of selected repositories that we will operate on during initial development, this ticket isn't about fetching the data from those PRs.
UPD: Link to PR Gathering Doc: https://docs.google.com/spreadsheets/d/1ELqRlo27bOSUwc3Dy77G3-V5fQy2NRgO2uKKeo6a7-E/edit?usp=sharing