**Code Datasets** - Githubissues

ncoop57 commented 3 years ago

[x] Datasets to use?
[x] How to collect the datasets?
[x] How to store and organize the datasets?
[x] What filtering/preprocessing/processing needs to be done to the datasets?
[x] Merge data onto one TPU
[x] Figure out deduplicating dataset
[x] Setup dataloading of dataset using HF datasets
[x] Talk with owner of the eye archive community for hosting our dataset similar to the pile

jlvvlj commented 3 years ago

Datasets considered : https://docs.google.com/spreadsheets/d/1eOVxhLwksOJ51PbiMxzogc-sWu6NylEVPdmEPcbxu28/edit#gid=0

bentrevett commented 3 years ago

Here's a bunch of "code" datasets I've been collecting the links to:

https://sumith1896.github.io/spoc/ (psuedocode and test cases)
https://github.com/TellinaTool/nl2bash (bash and natural language)
http://groups.inf.ed.ac.uk/cup/javaGithub/ (java code)
https://conala-corpus.github.io/ (python and natural language)
https://github.com/github/CodeSearchNet (code and docstrings in six languages)
https://github.com/tech-srl/slm-code-generation/ (1.3M java examples)
https://github.com/tech-srl/code2vec (14M java examples)
https://github.com/tech-srl/code2seq (16M java examples)
http://leclair.tech/data/funcom/ (2.1M java examples)
https://www.sri.inf.ethz.ch/py150 (150k python examples)
https://www.microsoft.com/en-us/download/details.aspx?id=56844 (C# programs as graphs)
https://github.com/facebookresearch/Neural-Code-Search-Evaluation-Dataset (neural code search dataset)
https://bitbucket.org/iiscseal/nbl/src/master/ (C code)
https://github.com/Microsoft/msrc-dpu-learning-to-represent-edits (see paper)
- http://www.cs.cmu.edu/˜pengchey/githubedits.zip
https://github.com/rajasagashe/juice (1.5M Python notebooks with interleaved natural language and code cells)
https://github.com/sola-st/IdBench (pairs of identifiers with measured relatedness and similarity)
https://sites.google.com/view/core2019/ (paired code in Java and code reviews)
- from: https://arxiv.org/abs/1912.09652
https://github.com/SoftWiser-group/CoDiSum (Java diffs and commit messages)
https://github.com/LittleYUYU/CoaCor (SQL and natural language pairs)
https://github.com/sriniiyer/codenn (SQL and natural language pairs)
https://github.com/LittleYUYU/StackOverflow-Question-Code-Dataset (stackoverflow question and code pairs)
https://github.com/xing-hu/EMSE-DeepCom (Java asts)
https://github.com/zhangj111/astnn (asts of Java and C)
https://groups.inf.ed.ac.uk/cup/comment-locator/ (C code with binary "should there be a comment here?" label)
https://github.com/microsoft/methods2test (Java methods and test cases)
https://github.com/EdinburghNLP/code-docstring-corpus (python with docstrings)
https://github.com/IBM/Project_CodeNet (50 programming languages)
https://github.com/hendrycks/apps (python)

A lot of these are just methods, which seems to be popular in machine learning for code. Unfortunately have their docstrings removed, which is not what you'd want for a Copilot clone.

However, if enough of these datasets are appended together you might get something similar to The Pile which might be sufficient. The other option is just scraping GitHub, but checking the licence of each repo to avoid any issues that Copilot is raising at the moment.

neubig commented 3 years ago

These are great lists of datasets, I think a lot of them could be useful for evaluation.

For the main training I'd probably suggest just scraping github and Stack Overflow though. I believe some of my collaborators at CMU use GitTorrent to do scraping, and it might not be too hard to get a huge corpus of repos once we decide what our inclusion criteria are: https://github.com/cjb/GitTorrent

Some examples of inclusion criteria might include:

Popularity/stars
Language
License

ncoop57 commented 3 years ago

What's a good time for us to meet? I'm EST time and I generally won't be able to work on this until the evening during the weekdays

ncoop57 commented 3 years ago

These are great lists of datasets, I think a lot of them could be useful for evaluation.

For the main training I'd probably suggest just scraping github and Stack Overflow though. I believe some of my collaborators at CMU use GitTorrent to do scraping, and it might not be too hard to get a huge corpus of repos once we decide what our inclusion criteria are: https://github.com/cjb/GitTorrent

Some examples of inclusion criteria might include:

Popularity/stars

Language

License

Additional criteria could be time since last commit, number of contributors, and usage of automated tests (this last one is probably not possible to easily detect)

neubig commented 3 years ago

I'm also in Eastern time and prefer to meet during the day if possible, but can also make a brief meeting during evenings work.

ncoop57 commented 3 years ago

Since the meeting today will be the most important i can be a bit more lenient and do it during the day if needed

@neubig could you create a when2meet or doodle poll to figure out the time today? I'm on my phone so it's difficult for me to make it

reshinthadithyan commented 3 years ago

I'll make the when2meet poll. What is the preferred time zone?

ncoop57 commented 3 years ago

I'll make the when2meet poll. What is the preferred time zone?

Est since both @neubig and I are in that

reshinthadithyan commented 3 years ago

https://www.when2meet.com/?12238619-choe4 Please fill this out.

taisazero commented 3 years ago

I'll make the when2meet poll. What is the preferred time zone?

Est since both @neubig and I are in that

I'm also in EST. I also generally work on this in the evenings and nights during weekdays. Friday morning is an exception. I'll fill out the when2meet!

ncoop57 commented 3 years ago

Looks like we can do the meeting at 10:30/now. I'll create a voice channel on discord

https://discord.gg/7R4YkXZH

ncoop57 commented 3 years ago

These datasets look awesome, could you please add them to this spreadsheet, we are consolidating them there: https://docs.google.com/spreadsheets/d/1eOVxhLwksOJ51PbiMxzogc-sWu6NylEVPdmEPcbxu28/edit?usp=sharing

@bentrevett

neubig commented 3 years ago

Note: GHTorrent Project: A good way to efficiently retrieve the info of many github repos https://ghtorrent.org/

bentrevett commented 3 years ago

Another thing to add is checking for duplicate code, from https://dl.acm.org/doi/10.1145/3133908: "This paper analyzes a corpus of 4.5 million non-fork projects hosted on GitHub representing over 428 million files written in Java, C++, Python, and JavaScript. We found that this corpus has a mere 85 million unique files. In other words, 70% of the code on GitHub consists of clones of previously created files."

Relevant paper on the "adverse effects of code duplication in machine learning models on code": https://arxiv.org/abs/1812.06469

The tool that comes with the above paper: https://github.com/Microsoft/near-duplicate-code-detector

zitterbewegung commented 3 years ago

Note, we would want to also want the ability to do a reverse search of the code datasets for software that is derived from this so that if we return a large amount of code that is under a specific license such as the GPL then we should disallow that output since the output might be infringing on copyright. This would also help us make sure we aren't overfitting.

CodedotAl / gpt-code-clippy

Code Datasets #2

CodedotAl / gpt-code-clippy

**Code Datasets** #2

Code Datasets #2