explosion / spaCy

💫 Industrial-strength Natural Language Processing (NLP) in Python
https://spacy.io
MIT License
30.07k stars 4.4k forks source link

Licensing of norwegian spaCy model #4280

Closed miktoki closed 5 years ago

miktoki commented 5 years ago

Referring to #3082, I have been in contact with the legal owner of noWaC, the dataset we wish to base a norwegian spaCy model on together with @jarib. They are unable to change the license of the corpus itself due to norwegian law (se explanation in section 1.2 in the noWaC paper), but are willing to set up some special agreement which would allow us to use the corpus to generate the language model. They have a standard contract available, where we could set the cost of the license to 0, and specific terms specified in Appendix 2. I also think we should further discuss the duration of the contract with them.

As pointed out by @ines here, the language models could count as derivative works based on the corpus.

@honnibal @ines, what do you need the terms of the contract to include? Also, is it sufficient to have the agreement in Norwegian? Do you perhaps have a template we could base the contract no, or alternatively, if their contract is used, what should be included in Appendix 2?

jarib commented 5 years ago

The standard contract linked above is in English, so the question of an agreement in Norwegian is not really relevant.

The use for NoWaC in spaCy would be pre-training to improve the models. The results using a smaller corpus (not NoWaC, which is much larger) were quite promising.

honnibal commented 5 years ago

spaCy doesn't need to distribute the NoWaC corpus, so in the short-term the question is really whether you guys feel the license you have enables you to issue us an MIT license to the vectors you've trained. I have nb_core_web_sm and nb_core_web_md models trained without the pretraining, scheduled for release with v2.2.

We'd like to replicate the vectors and add pretraining to the pipeline, so I guess for that we'll need a license to the corpora (or to use equivalent ones instead). But for the immediate release it shouldn't be necessary.

jarib commented 5 years ago

The word vectors I used to build my nb models already have an acceptable CC license.

The agreement suggested in this issue is only relevant if we want to improve the models by pre-training with the NoWaC corpus.

ines commented 5 years ago

@jarib Sounds good! The standard contract looks fine to me. Once we're to wire up the pretraining, we can discuss the details over email and get this signed 🙂

lock[bot] commented 5 years ago

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.