allenai / s2orc

S2ORC: The Semantic Scholar Open Research Corpus: https://www.aclweb.org/anthology/2020.acl-main.447/
800 stars 64 forks source link

How to deal with the copyright issue? #24

Open shizhediao opened 4 years ago

shizhediao commented 4 years ago

Hi, Thanks for your great dataset which definitely speeds up scientific research! As a fan and user of your dataset, I was really curious how do you guys deal with copyright issues?

  1. Do you have the right to distribute the submitted articles?
  2. As a user of the dataset, may I have the redistribution right? For example, if I do another process step designed for some research tasks based on your dataset, could I distribute it to other people? Thanks!
kyleclo commented 3 years ago

Hi @shizhediao, we already discussed this over email; just copying my response here for others:

Copyright is pretty tricky! We consulted with a lawyer about this for a long time, and ultimately decided that releasing this under CC BY-NC 2.0 https://github.com/allenai/s2orc/blob/master/README.md#license is safe. There are a variety of factors in our favor here: We're only releasing full text data that's derived from open-access papers. We're only allowing S2ORC for non-commercial use. And the S2ORC text isn't really usable for direct consumption of the papers (i.e. reading the paper like a PDF) and doesn't contain a lot of the content necessary to read the paper (e.g. visual layout, figures, etc.), so can likely argue that this falls under fair use for research.

Please take a look at the license which should explain what you can/can't do with S2ORC & derivations with respect to redistribution. In short, yes, what we're hoping for is researchers will use S2ORC as a "meta" corpus to derive further task-specific NLP datasets that they can distribute.