UniversalDependencies / UD_English-EWT

English data
Creative Commons Attribution Share Alike 4.0 International
201 stars 43 forks source link

Question about licensing #58

Closed matanox closed 2 years ago

matanox commented 6 years ago

Hi,

As I see here, this treebank is the property of Stanford University and licensed under CC-BY-SA. Does anyone of the people involved happen to know, whether Stanford University would consider the use of a parser using this treebank, a violation of the CC-BY-SA terms?

Under the fair use doctrine (superficially illustrated here), and I'm not aware of any precedent applying to this case, a use that is transformative is allowed; however this raises the question of whether the treebanks were originally mandated for automated parsing as we have today. It also raises the question of whether Stanford University has/had any plans to monetize the treebanks or parsers made on top of them.

Obviously software building a statistical model from a copyrighted work, was not the most obvious scenario which copyright law was designed for.

I suppose we are otherwise more-or-less practically blocked from using any modern parsers in any commercial applications, as you'd likely agree that the amount of annotated data necessary for training a parser is under most known machine learning regimes insatiable ― and therefore impractical to accomplish from scratch.

Would anyone be able to suggest whether Stanford has insofar had an official statement or policy about this?

Thanks, Matan

jnivre commented 6 years ago

I am not a representative of Stanford, so I cannot answer the specific question about this treebank. However, my understanding in general is that parser training constitutes fair use of a treebank with the CC-BY-SA license. Some UD treebanks have the NC restriction, in which case parsers can be trained and used for research and education but not for commercial exploitation.

dseddah commented 6 years ago

Hi Joakim and Matan, I always thought that the UD English treebank was under the same licence as its original source (https://catalog.ldc.upenn.edu/LDC2012T13) so it was my understanding that a company could use it assuming it took a license from the LDC (175$ if I checked that well). Or is it the case that the tokens are under a specific license while the annotations are subjected to another (cf. Arabic NYU treebank, the French FTB, one of the Japonese TBs) ?

Djamé

dan-zeman commented 6 years ago

If the underlying text was provided with some restrictions, then these restrictions often apply also to the derivative work. This is the case of Arabic-NYUAD, French-FTB and Japanese-KTC, as I understand it. If the underlying text is freely available and redistributable, then presumably the LDC restrictions apply only to the added value (annotation) and you can still freely distribute the same text with your own annotation. I have not checked though whether this applies to the English Web Treebank but I assume this would explain why it is avaliable under CC.

To the original question: If you have permission to use a copyrighted text, and you train a parsing model (or any other model) using that text (provided the license agreement did not specifically ban you from using the corpus for this particular purpose) then I think you can freely distribute the model and let it use anyone. The users cannot reconstruct the original work from your model, nor can they get its translation or any other derived text of any artistic value. The model is a large file of short strings and numbers, so what? If you could not do this, then you possibly also could not buy a textbook of Japanese, use it to learn Japanese, and then use the knowledge you acquired to, say, guide Japanese tourists through your town.

manning commented 6 years ago

Hi @matanster, sorry for the super-duper slow reply. I am the faculty at Stanford who coordinated the production of UD_English-EWT.

However, I am not a lawyer and Stanford doesn’t authorize me to give legal advice. Stanford University has not made any official statement about this treebank and I don't think I could get one out of them. Stanford's official position is that it doesn’t give legal advice to others, and you should consult your own lawyer.

Notwithstanding all that, my understanding is that you are completely free to use the data to train and distribute parsers as you wish. I think you really don't need to worry about Stanford getting upset.

tl;dr:

As a matter of policy, Stanford as an institution has strongly supported both open sharing and fair use but also the rights of authors and composers. You can find a quite active website here: https://fairuse.stanford.edu/ .

FWIW, I'm not sure you're actually framing all the legal questions the best way. AFAICS, the CC BY-SA 4.0 license gives you the right to use the treebank as you wish, and so there is no problem, since CC licenses are especially designed to allow copyright holders to broadly give rights. Secondly, I suspect that parser model files are anyway not copyrightable works, and so again there is no problem with building parsers from this treebank. Conversely, although, again, I am not a lawyer, I would be very suspicious that an attempt to argue fair use would work here: Since you're using most or all of the treebank and the treebank was constructed to enable language technology applications, it does not seem to me that an argument under fair use could prevail. But I'm not a lawyer.

manning commented 6 years ago

@dseddah: No, this isn't right. The UD_English-EWT treebank isn't an LDC-licensed treebank. (Among other things, the LDC license prohibits distribution by licensees, even to people with a valid LDC license, and so we could not have the treebank on the UD website under those terms.)

More precisely, our work and licensing covers the annotations and database assembly, not the underlying texts. What is their status? I will admit that this is a little murky. As I understand it, the texts were gathered by Google from various sources and provided to LDC. Some are public domain (such as the Enron emails), while others may be copyright either by the original authors or some web company, if their terms of service asserted a copyright transfer. How carefully was collection done with respect to copyright? No info. The EWT treebank has a vague statement of copyright which we reproduced. The LDC gave us their "no objection" to us using the underlying texts, while not giving us any guarantee of status. However, I feel that with respect to the original texts, we're nevertheless on pretty safe ground, and an assertion of fair use would work here, as, with respect to the original texts, our character of use is transformative, non-commercial, doesn't displace uses of the original works, and the treebank only contains a limited amount of material from each source.

Still not a lawyer … but someone who has spent too much time on copyright and open source over the last 20 years…. (Also, agree with everything @dan-zeman wrote.)

dseddah commented 6 years ago

Thank you @manning and @dan-zeman for these clarifications.

If the key here seems to be the ability to reproduce the original text, do you think that providing the copyrighted text in a transformed way could work (replacing all tokens with brown clusters or disinflected forms for examples) ?

nschneid commented 6 years ago

When I first built the STREUSLE corpus, which adds semantic annotations to a portion of EWT, I obtained permission from LDC and Google to redistribute the sentences and POS tags. If they're OK with my redistributing it under an open license, they're probably OK with the UD project doing the same.