EleutherAI / math-lm

MIT License
1.03k stars 79 forks source link

Filtering Github issues and diffs #10

Closed zhangir-azerbayev closed 1 year ago

zhangir-azerbayev commented 1 year ago

Our Github source code dataset is based on the deduplicated stack filtered down to only include numerical computing, computer algebra, and formal math.

The pilev2 includes Github issues and diffs subsets (available at s3://s-eai-neox/data/pilev2/pilev2_local_deduped/GithubDiff_ver2/ and s3://s-eai-neox/data/pilev2/pilev2_local_deduped/GithubIssue_ver2/). There is no good intrinsic way to determine whether an issue or diff meets our filtering criteria. Therefore, what we have to do is compute a table of the repositories that our included in our source code dataset, and filter issues and diffs based on that list of repositories.

To get started, study proof-pile-v2/source_code and write the script in a directory called proof-pile-v2/issues_and_diffs.

zhangir-azerbayev commented 1 year ago

Completed in PR #18