bigcode-project / starcoder

Home of StarCoder: fine-tuning & inference!
Apache License 2.0
7.26k stars 516 forks source link

Project Uses Code Under Non-Permissive License to train AI models in violation of these licenses. #9

Open duaneking opened 1 year ago

duaneking commented 1 year ago

People had their work added to the training set without their explicit opt in permission and without their consent.

This means that this entire project stack, as it's called, is stolen code, and makes the output stolen as well; Because you're generating code off of other people's work without their consent and not remunerating them. This is theft. This is a violation of all the licenses.

To make this worse, your website says everything is permissively based when that's not true. You're actively lying to the community about where you got your training set and code from, and telling people its Based on public data that was permissible to train on, when, in reality, you stole the code, and didn't tell the authors, and now you're trying to walk it back and pretend like they can opt out when in reality, this entire thing wouldn't exist unless you had stolen our code.

As somebody who was illegally added to that data set without his consent, I opt out. Earlier versions of this project could not exist without my code. And so this project should not exist at all because it only exists due to the theft of our code. This project shouldn't exist in its current form because it's built on stolen property.

Symbolk commented 1 year ago

IMO, this is a long-lasting discussion since OpenAI make Codex&Copilot a profitable product. People have complained, but many paid.

I vote for this project, as least it takes data from the open-source, and use it for the open-source. And they also provide tools to detect and remove code from the training data, better than OpenAI.

duaneking commented 1 year ago

Projects built by open source authors must have their open source license respected for the use to be valid.

And by using our code without our consent, for stuff that was not licensed and not agreed to, this violates the open source license. I never granted anybody the right to use the code for training bad AI's that leech off the community to make private companies money off my code without me getting anything for it. That right was never granted and it is not granted by the license I chose. In fact, not a single open source license even mentions the right to train AI, so its not a right granted.

Open source doesn't mean you can just take the code and do what you want; you have to respect the license terms. That is not happening here.

Combustible commented 1 year ago

For what it's worth - I have several projects on github and according to HuggingFace's stack tool (https://bigcode-in-the-stack.hf.space/), only those with a compatible license (public domain or apache in my case) were included in training this tool. My GPL and AGPL projects were not trained on. For myself at least, my wishes as encoded in those licenses were respected, without me having to do anything.

duaneking commented 1 year ago

That's not the case here, as I had multiple projects that were included without my consent.

They did not respect the licenses..

lvwerra commented 1 year ago

Hi @duaneking

To make this worse, your website says everything is permissively based when that's not true.

We filtered the whole dataset for permissively licensed code. We followed the definition of BlueOakCouncil for licenses that are considered permissive. You can find the full list of included licenses here. Naturally, license detection is a complex subject and it can happen that the license of some repositories were misclassified. If so, please let us know so we can improve the pipeline. Can you list the cases where the license was not detected properly?

We work both on tools to test if your data is part of the pretraining data as well as an opt-out mechanism for users that want to exclude certain or all repositories from The Stack dataset. We are updating the dataset regularly and require users of the dataset to use the latest version to respect opt-outs:

Copying code even from permissively licensed repositories still requires attribution be it by humans or automated systems. As such we built tools to support users to attribute generated code properly with the following:

The VSCode extension integrates the fast membership test so the user can check if generated code from the model was in the pretraining data and links to the full text search to find which repos contained the code (and their licenses).

Lastly, we also remove PII (such as names, email addresses, keys and passwords) from the dataset before training the model to avoid abuses such as extraction of such information from the model.

We are actively working on improving the data governance of the project and the tools I mentioned above. If you have constructive ideas how to improve our toolstack we are happy to hear them.

nilsdeppe commented 1 year ago

@lvwerra the license link doesn't seem to be working. I get a 404.

duaneking commented 1 year ago

I like that list but I think we are talking about different things.

Just because a open source license is permissive to be used by humans does not mean that it is permissive to be used by or for AI, as even the laws of the United States copyright office explicitly forbid AI for being used to make new inventions.

I have already made an opt out request; But I'm deeply offended by the need to opt out, because from my perspective I expect humans to be trained on my code and not AI, I never consented to allow an AI to consume/view/collect/be trained on it. I'm on the side of the humans, not the machines.

IeatToilets commented 1 year ago

For the majority, open source license is free to use by AI. Unless there's a strict rule within the license that states otherwise. It's fairly legal. AI can be used for multiple purposes not only inventions. I highly doubt that an ai itself can invent things. Imagine it as our version of Jarvis from tony stark. The Ai of today act's as an assistant and it can be beneficial to everyone.

With that said i understand your point. I suggest that adding restrictions to license is a better choice to avoid issues with other creators.

Symbolk commented 1 year ago

For the majority, open source license is free to use by AI. Unless there's a strict rule within the license that states otherwise. It's fairly legal. AI can be used for multiple purposes not only inventions. I highly doubt that an ai itself can invent things. Imagine it as our version of Jarvis from tony stark. The Ai of today act's as an assistant and it can be beneficial to everyone.

With that said i understand your point. I suggest that adding restrictions to license is a better choice to avoid issues with other creators.

Maybe the old-school open-source licenses are out-dated and should add new claims, in the era of ubiquitous LLMs. Not only those trained solely on open-source data like StarCoder, but general-purpose LLMs like GPT-4.

duaneking commented 1 year ago

For the majority, open source license is free to use by AI.

Not true. In intellectual property law, as I understand it, rights are expressly given. And these licenses do not expressly give the right to train to AI, and in fact AI is not mentioned at all, so the right was never granted.

Unless there's a strict rule within the license that states otherwise.

Some of my code was GPL and AGPL; The viral nature of these licenses requires that code that's mixed with it also gained that license. So unless you're saying that everything is now AGPL, then in that case, I might believe you. But unfortunately, that's not the case here, and that's not the license being used for this project. So the license was violated when the license was not respected.

It's fairly legal.

No its not, the GPL is being violated.

AI can be used for multiple purposes not only inventions.

Not under the license I picked.