Closed ncoop57 closed 3 years ago
Here's a bunch of "code" datasets I've been collecting the links to:
A lot of these are just methods, which seems to be popular in machine learning for code. Unfortunately have their docstrings removed, which is not what you'd want for a Copilot clone.
However, if enough of these datasets are appended together you might get something similar to The Pile which might be sufficient. The other option is just scraping GitHub, but checking the licence of each repo to avoid any issues that Copilot is raising at the moment.
These are great lists of datasets, I think a lot of them could be useful for evaluation.
For the main training I'd probably suggest just scraping github and Stack Overflow though. I believe some of my collaborators at CMU use GitTorrent to do scraping, and it might not be too hard to get a huge corpus of repos once we decide what our inclusion criteria are: https://github.com/cjb/GitTorrent
Some examples of inclusion criteria might include:
What's a good time for us to meet? I'm EST time and I generally won't be able to work on this until the evening during the weekdays
These are great lists of datasets, I think a lot of them could be useful for evaluation.
For the main training I'd probably suggest just scraping github and Stack Overflow though. I believe some of my collaborators at CMU use GitTorrent to do scraping, and it might not be too hard to get a huge corpus of repos once we decide what our inclusion criteria are: https://github.com/cjb/GitTorrent
Some examples of inclusion criteria might include:
- Popularity/stars
- Language
- License
Additional criteria could be time since last commit, number of contributors, and usage of automated tests (this last one is probably not possible to easily detect)
I'm also in Eastern time and prefer to meet during the day if possible, but can also make a brief meeting during evenings work.
Since the meeting today will be the most important i can be a bit more lenient and do it during the day if needed
@neubig could you create a when2meet or doodle poll to figure out the time today? I'm on my phone so it's difficult for me to make it
I'll make the when2meet poll. What is the preferred time zone?
I'll make the when2meet poll. What is the preferred time zone?
Est since both @neubig and I are in that
https://www.when2meet.com/?12238619-choe4 Please fill this out.
I'll make the when2meet poll. What is the preferred time zone?
Est since both @neubig and I are in that
I'm also in EST. I also generally work on this in the evenings and nights during weekdays. Friday morning is an exception. I'll fill out the when2meet!
Looks like we can do the meeting at 10:30/now. I'll create a voice channel on discord
These datasets look awesome, could you please add them to this spreadsheet, we are consolidating them there: https://docs.google.com/spreadsheets/d/1eOVxhLwksOJ51PbiMxzogc-sWu6NylEVPdmEPcbxu28/edit?usp=sharing
@bentrevett
Note: GHTorrent Project: A good way to efficiently retrieve the info of many github repos https://ghtorrent.org/
Another thing to add is checking for duplicate code, from https://dl.acm.org/doi/10.1145/3133908: "This paper analyzes a corpus of 4.5 million non-fork projects hosted on GitHub representing over 428 million files written in Java, C++, Python, and JavaScript. We found that this corpus has a mere 85 million unique files. In other words, 70% of the code on GitHub consists of clones of previously created files."
Relevant paper on the "adverse effects of code duplication in machine learning models on code": https://arxiv.org/abs/1812.06469
The tool that comes with the above paper: https://github.com/Microsoft/near-duplicate-code-detector
Note, we would want to also want the ability to do a reverse search of the code datasets for software that is derived from this so that if we return a large amount of code that is under a specific license such as the GPL then we should disallow that output since the output might be infringing on copyright. This would also help us make sure we aren't overfitting.