bigcode-project / starcoder

Home of StarCoder: fine-tuning & inference!
Apache License 2.0
7.25k stars 514 forks source link

Removal request & notice: permissive licensing might often still be unsuitable(!) for training set inclusion #160

Open ell1e opened 5 months ago

ell1e commented 5 months ago

I'd just like you to know that code with permissive licensing with attribution requirements are possibly unsuitable for training set inclusion. I'm bringing this to your attention not as a lawyer, but as a maintainer. Ask your own council. However, attribution requirements usually means derivatives must retain attribution of the original author. LLMs are apparently well-known to occasionally spit out exact derivatives, but without satisfying attribution requirements, which suggests this practice could be illegal.

I therefore request you at the very least process opt-out requests in retrospect for pre-existing data sets to fix this. However, just to stress this again, I'm not a lawyer and this isn't legal advice. But at least from the outside, this looks troubling.

For example, it appears you included repositories of mine that have attribution requirements:

Screenshot_20240404_155403

I don't understand how StarCoder would possibly satisfy them.

lvwerra commented 4 months ago

Hi @ell1e, thanks for pointing this out. We worked on a set of tools that should allow users to properly attribute sources if the model generated verbatim copies from the training dataset:

For more info see section 8 of the paper. Hope this helps!

ell1e commented 4 months ago

I assume this would need to be integrated right with the usual query mechanism for people to actually regularly use it. Is that currently the case with the current auto complete plugins or wherever this is commonly used?

Also, to my knowledge and I'm not at all a lawyer, but I thought copyright law doesn't just apply to verbatim copies but any notable derivatives, as long as it still is somewhat "clearly" related in the eyes of a normal human, and/or is more vaguely a derivative but still could be considered a substitute, or something like that. (Don't ask me how the exact rules work, but I don't think it's just verbatim copies.) How do you deal with that?

As long as there aren't any good answers to that being all dealt with out of the box in a somewhat reliable way for most actual users of starcoder, I suggest you keep honoring opt-out request more aggressively.