Relicense the finetuned checkpoints under CC BY-SA

koute commented 1 year ago

The license of the finetuned checkpoints currently makes no sense.

The base model was almost certainly trained on a ton of unlicensed all-rights-reserved data. In particular, the README says that it was trained on a dataset derived from the Pile, which includes ~100GB of commercial (some might say "pirated") ebooks (the Books3 dataset). And yet this model is licensed under CC BY-SA.

The finetuned model was trained on data which is under a less restrictive license (CC BY-NC, which is less restrictive than "all rights reserved") and yet suddenly the model has to follow the license of the data that was used for training?

This makes no sense. If training on unlicensed/all-rights-reserved data and releasing that model under an arbitrary license is OK then training it on less restrictive CC BY-NC data and releasing it under an arbitrary license is OK too. Alternatively, if the model has to follow the license of the data on which it was trained on then the base model has to be taken down as it was trained on all-rights-reserved data for which you had no license.

waleed177 commented 1 year ago

it also makes no sense for me, but i am no lawyer. but hey, i am happy with libre base model c: thank you stabilityAI

mcmonkey4eva commented 1 year ago

(I am not a lawyer, this is not legal advice, consult a real lawyer before making decisions, this is just my personal thought)

Scale matters a lot when considering dataset usage rights. The base model is trained on a massive scale mix of content such that it doesn't really directly retain much content from any one individual source (ie in theory the license doesn't matter much). The finetune is directly on top of a small section of entirely-license-restricted content (ie the model will directly retain information from the licensed content, thus the license must be matched appropriately).

As another way of thinking about it: Imagine an artist/author/whatever human creative. If that person looks at some copyright worked and copies from it directly, they're violating that copyright. However that same person has also been through a lifetime of looking at copyrighted works that have undoubtedly influenced their creative thought, but when they sit down and make something original (a work derivative of the mix of ideas in their head, many of which originate from copyright-restricted works), their new work is not subject to prior copyrights, it is considered their own work.

zoobab commented 1 year ago

CC Non Commercial means it cannot be packaged in Debian, due to the non commercial restriction:

https://github.com/celery/celery/issues/2890

Could you re-release it under a copyleft license if you want users that modify it to republish their changes?

And what is the dataset used the training?

mcmonkey4eva commented 1 year ago

@zoobab View the readme @ https://github.com/Stability-AI/StableLM#models for dataset info. More detail will be published soon.

I don't think a ten gig+ model file is fit to be packaged natively into Debian anyway? The actual relevant source code to run LLMs is separately maintained and separately licensed. It's just the models that have license info in this repository, and it's only the Instruct-finetune that's non-commercial, which has to be licensed that way due to the dataset used for the Instruct finetuning.

Future revisions of the instruct-finetune might use a different dataset and thus have a different license.

Stability-AI / StableLM

Relicense the finetuned checkpoints under CC BY-SA #33