How to deal with the copyright issue?

Hi @shizhediao, we already discussed this over email; just copying my response here for others:

Copyright is pretty tricky! We consulted with a lawyer about this for a long time, and ultimately decided that releasing this under CC BY-NC 2.0 https://github.com/allenai/s2orc/blob/master/README.md#license is safe. There are a variety of factors in our favor here: We're only releasing full text data that's derived from open-access papers. We're only allowing S2ORC for non-commercial use. And the S2ORC text isn't really usable for direct consumption of the papers (i.e. reading the paper like a PDF) and doesn't contain a lot of the content necessary to read the paper (e.g. visual layout, figures, etc.), so can likely argue that this falls under fair use for research.

Please take a look at the license which should explain what you can/can't do with S2ORC & derivations with respect to redistribution. In short, yes, what we're hoping for is researchers will use S2ORC as a "meta" corpus to derive further task-specific NLP datasets that they can distribute.

allenai / s2orc

How to deal with the copyright issue? #24