CodedotAl / gpt-code-clippy

Full description can be found here: https://discuss.huggingface.co/t/pretrain-gpt-neo-for-open-source-github-copilot-model/7678?u=ncoop57
Apache License 2.0
3.29k stars 221 forks source link

**Code Datasets** #2

Closed ncoop57 closed 3 years ago

ncoop57 commented 3 years ago
jlvvlj commented 3 years ago

Datasets considered : https://docs.google.com/spreadsheets/d/1eOVxhLwksOJ51PbiMxzogc-sWu6NylEVPdmEPcbxu28/edit#gid=0

bentrevett commented 3 years ago

Here's a bunch of "code" datasets I've been collecting the links to:

A lot of these are just methods, which seems to be popular in machine learning for code. Unfortunately have their docstrings removed, which is not what you'd want for a Copilot clone.

However, if enough of these datasets are appended together you might get something similar to The Pile which might be sufficient. The other option is just scraping GitHub, but checking the licence of each repo to avoid any issues that Copilot is raising at the moment.

neubig commented 3 years ago

These are great lists of datasets, I think a lot of them could be useful for evaluation.

For the main training I'd probably suggest just scraping github and Stack Overflow though. I believe some of my collaborators at CMU use GitTorrent to do scraping, and it might not be too hard to get a huge corpus of repos once we decide what our inclusion criteria are: https://github.com/cjb/GitTorrent

Some examples of inclusion criteria might include:

  1. Popularity/stars
  2. Language
  3. License
ncoop57 commented 3 years ago

What's a good time for us to meet? I'm EST time and I generally won't be able to work on this until the evening during the weekdays

ncoop57 commented 3 years ago

These are great lists of datasets, I think a lot of them could be useful for evaluation.

For the main training I'd probably suggest just scraping github and Stack Overflow though. I believe some of my collaborators at CMU use GitTorrent to do scraping, and it might not be too hard to get a huge corpus of repos once we decide what our inclusion criteria are: https://github.com/cjb/GitTorrent

Some examples of inclusion criteria might include:

  1. Popularity/stars
  2. Language
  3. License

Additional criteria could be time since last commit, number of contributors, and usage of automated tests (this last one is probably not possible to easily detect)

neubig commented 3 years ago

I'm also in Eastern time and prefer to meet during the day if possible, but can also make a brief meeting during evenings work.

ncoop57 commented 3 years ago

Since the meeting today will be the most important i can be a bit more lenient and do it during the day if needed

@neubig could you create a when2meet or doodle poll to figure out the time today? I'm on my phone so it's difficult for me to make it

reshinthadithyan commented 3 years ago

I'll make the when2meet poll. What is the preferred time zone?

ncoop57 commented 3 years ago

I'll make the when2meet poll. What is the preferred time zone?

Est since both @neubig and I are in that

reshinthadithyan commented 3 years ago

https://www.when2meet.com/?12238619-choe4 Please fill this out.

taisazero commented 3 years ago

I'll make the when2meet poll. What is the preferred time zone?

Est since both @neubig and I are in that

I'm also in EST. I also generally work on this in the evenings and nights during weekdays. Friday morning is an exception. I'll fill out the when2meet!

ncoop57 commented 3 years ago

Looks like we can do the meeting at 10:30/now. I'll create a voice channel on discord

https://discord.gg/7R4YkXZH

ncoop57 commented 3 years ago

These datasets look awesome, could you please add them to this spreadsheet, we are consolidating them there: https://docs.google.com/spreadsheets/d/1eOVxhLwksOJ51PbiMxzogc-sWu6NylEVPdmEPcbxu28/edit?usp=sharing

@bentrevett

neubig commented 3 years ago

Note: GHTorrent Project: A good way to efficiently retrieve the info of many github repos https://ghtorrent.org/

bentrevett commented 3 years ago

Another thing to add is checking for duplicate code, from https://dl.acm.org/doi/10.1145/3133908: "This paper analyzes a corpus of 4.5 million non-fork projects hosted on GitHub representing over 428 million files written in Java, C++, Python, and JavaScript. We found that this corpus has a mere 85 million unique files. In other words, 70% of the code on GitHub consists of clones of previously created files."

Relevant paper on the "adverse effects of code duplication in machine learning models on code": https://arxiv.org/abs/1812.06469

The tool that comes with the above paper: https://github.com/Microsoft/near-duplicate-code-detector

zitterbewegung commented 3 years ago

Note, we would want to also want the ability to do a reverse search of the code datasets for software that is derived from this so that if we return a large amount of code that is under a specific license such as the GPL then we should disallow that output since the output might be infringing on copyright. This would also help us make sure we aren't overfitting.