CodedotAl / gpt-code-clippy

Full description can be found here: https://discuss.huggingface.co/t/pretrain-gpt-neo-for-open-source-github-copilot-model/7678?u=ncoop57
Apache License 2.0
3.29k stars 220 forks source link

Things That Could Go Wrong #7

Open ncoop57 opened 3 years ago

ncoop57 commented 3 years ago

Hi y'all I'd like to make sure we do plenty of brainstorming on where things can go wrong in terms of ethical concerns. I don't want our field to have the same issues that have happened in the other AI fields such as biases and lacking discussion of limitations. So, please use this issue to also (we also have an internal discord channel where we discuss this in a less formal setting, which I will be periodically synthesizing to here) discuss any things that could go wrong! Here are already a few things that have been discussed:

  1. vulnerabilities being inserted into completions
  2. Licensing Issues
  3. Automating developers out of a job
urialon commented 3 years ago

Hi @ncoop57 , Great idea!

I think we don't really need to worry about vulnerabilities (1). I am not sure we can filter out vulnerabilities in an easy way. From my experience, I would expect GitHub code to not contain any obvious vulnerability.

StackOverflow code, however, is sometimes more problematic (because users provide code snippets that demonstrate one point, and sometimes neglect other points).

(2), licensing, is an important issue, as discussed on Twitter yesterday: https://twitter.com/hardmaru/status/1410219477992558595?s=21 (it seems that Microsoft trained their modem on GPL code, but did not release their model, which is a violation of the license). But since we're going to release everything, we are less restricted than Microsoft.

(3) is an important point, but I don't think it's a realistic concern at this time.

ncoop57 commented 3 years ago

Another thing that came up in our discussion was should repository owners be able to opt out of having their repository included in the training of our open source copilot?

There was a good point made that since repositories are specifically licensed for their use by the creator that we shouldn't have to ask them for permission. Also, any person could also just download their repository from github without even an account if it is public which would be an even bigger threat than us training a model on it.

One argument for still allowing owners to opt out would be that similar to EULAs that social networks have users sign, the owners of the repositories might not ever have thought their code would be used for such a purpose and perhaps if they did they might have made it private or chosen a different license that forbid such usage.

If we wanted to allow for owners to opt out I think we could do a system of once we have collected all the repositories, we would open an issue in each repository asking if the owner/admin wanted to opt out and have their repository removed from the candidate pool. I think this could be automated and we could set a deadline of whenever we start training the model for when we ask the owner/admin to respond by.

jlvvlj commented 3 years ago

I made a first draft on the topic from concerns expressed here and online: https://docs.google.com/document/d/1Lpjvnc_EB_idZBRcM_DJKopGAJnI9M21nlvYiZ-EyGQ/edit?usp=sharing

ncoop57 commented 3 years ago

Hey y'all originally I was wanting to ask developers to be able to opt out of having their repos included. However, after further consideration I've decided to forgo that even if a majority of people said they think it should be able to be opt-out (on a twitter poll I made). The reason is that

  1. generating a way to have ~700,000 repos be able to opt-out is very difficult and would take multiple days to just to send out the necessary messages by opening an issue on each repo (13 days if I used a singular GitHub API key).
  2. it is difficult to do correct and error prone and with our current timeline, I would not have the time to properly test it.
  3. since we are open sourcing our code, data, and model I believe the sentiment will be better received. Most of the reason I believe people are angry at the usage of their data is because github may possibly put the tool behind a paywall, which goes against the GPL licensing agreement.

We probably can keep our code under the Apache 2.0, but the model weights might need to be put under the GPL license. If you disagree with this decision and would like to discuss it more, I am still up for a debate on this