acl-org / acl-anthology

Data and software for building the ACL Anthology.
https://aclanthology.org
Apache License 2.0
370 stars 252 forks source link

License and copyright #3317

Open mjpost opened 1 month ago

mjpost commented 1 month ago

I would like to discuss the design of a big project whose goal is to identify the copyright and license of every paper on the Anthology. This will make explicit something that is currently a bit of a gray area, and will also help us adopt good practices into the future.

At the very least, I suggest adding an optional paper-level <license> and <copyright> tag. The license tag should name and also link to the license for the paper, and the copyright tag should identify the owner of the file.

For example, for almost all ACL papers, the copyright is transferred to us, and so the copyright holder is "Association for Computational Linguistics". We release materials under a CC-BY license, so we would display and link to that. There are, historically a few exceptions: for example, papers from the NRC Canada group were unable to sign over the copyright, since it belongs to the Crown, in which case we would note that.

This project would remove a lot of ambiguity and occasional requests I get for information. It is forward-looking in the age of web-crawling for purposes of training computer systems. If done correctly (e.g., via a selection form at final-paper submission time), it need not add too much work.

I welcome comments and discussion.

mbollmann commented 1 month ago

I would prefer to set this information at the volume level, and only override on paper level when necessary. I assume that 99% of papers in any given volume will have the same licensing and copyright.

Adding an explicit license tag seems like a good idea in any case.

I'm indifferent to the copyright information, mainly because it's such a country-specific concept AFAIK, but I would record it the same way as the license.