edsu / pymarc

process MARC records from Python
http://python.org/pypi/pymarc
Other
251 stars 99 forks source link

v4.0.0? #147

Open edsu opened 4 years ago

edsu commented 4 years ago

Given the recent changes to the leader, I was thinking that a new major release was warranted, v4.0.0. Are there any other major changes we might want to consider as part of this release, or should we just go for it?

edsu commented 4 years ago

One thing i was wondering about, given the fact that more people have contributed to the project than are listed in the copyright statement, is whether it might be good to move the code to being in the Public Domain and not asserting copyright or a license of any kind. What do people think about that, especially @gsf, @anarchivist and @Wooble?

gsf commented 4 years ago

I'm not up to date on the latest moods around licensing, so don't know if there are negatives around going public domain, but I hereby relinquish my copyright claim if you decide to do so.

anarchivist commented 4 years ago

I'm all for a public domain but I would ask us to consider an explicit license or declaration like CC0 or the Unlicense. We could also consider 0BSD along with an explicit copyright waiver.

anarchivist commented 4 years ago

I also relinquish my copyright claim.

herrboyer commented 4 years ago

@edsu maybe it could be time to announce the end of py2 support ? For now by just updating the readme/docs/setup.py/travis.yml, and, one day, do a big cleanup (linting, proper typing, no more six, etc.) that will definitely break py2 compatibility.

Wooble commented 4 years ago

same; I have no strong opinion either way as long as whatever we do doesn't make it suddenly hard for some subset of our users to keep using the library. (IANAL but I've heard vague things about public domain not being a thing in some jurisdictions?)

[this is re: copyright, not py2 support. I'm more than happy to make the subset of our users who refuse to use py3 unhappy ;) ]

edsu commented 4 years ago

@Wooble yes, I think that's an argument for using CC0 as @anarchivist suggested. Thanks everyone for being open to exploring this.

@herrboyer do you think now is the time to lint and type, and remove six?

herrboyer commented 4 years ago

It's always time to lint !

I think a combination of flake8 + flake8-docstring + black would provide a clean base for future development.

I use this config for most of my projects :

[flake8]
ignore = 
    # line length is handled by black
    E501
    # Missing docstring in public nested class (ex. Meta)
    D106
    # Missing docstring in magic method (__str__ ...)
    D105
    # First line should be in imperative mood -> I've never really like/understood this one...
    D401
    # E203 & W503 are not PEP 8 compliant
    # @see https://github.com/psf/black
    W503
    E203
per-file-ignores =
    # empty init does not really require doctrings
    __init__.py D104
    # no need for doctrings in tests
    test_* D103

And we could easily add a step in the CI to check linting before running the tests, this will avoid noise related to formatting in future PRs.

I'll be happy to work on this if everybody's OK with it.

Then we could work on typing and removing six / py2 support ?

edsu commented 4 years ago

This would be awesome @herrboyer and is worth waiting for in v4 I think. I will create new issues for both pylint and python3. If you take the first then I will take the latter (but I will wait till linting is completed before doing it).

I haven't used types much in my own work, does your linter setup do something with them?

herrboyer commented 4 years ago

Great ! I started working on #148 here https://github.com/edsu/pymarc/pull/150 : please note that I plan to use flake8 instead of pylint which is a bit easier to configure (and to please...).

You'll be using mypy to check types the same you use flake8/pylint for linting (mypy something.py). Default settings are OK, the only downside of it is that mypy will not raise an error nor a warning when you leave function parameters or returns un-typed.

I think this Python Type Checking (Guide) helped me when I started typing my code especially with custom types, Dict, Union, etc.

dbs commented 4 years ago

I only have a handful of commits in the project's history, and I'm not listed in the contributors list, but I would prefer to maintain the current well-understood permissive license.

I don't think moving to a public domain-type license would remove the burden of maintaining a list of code contributors. See the CC0 FAQ:

CC and the Free Software Foundation suggest that if you choose to apply CC0 to software, you include the following notice at the top of each file:

<PROGRAM NAME> - <DESCRIPTION>

Written in <YEAR> by <AUTHOR NAME> <AUTHOR E-MAIL ADDRESS>

[other author/contributor lines as appropriate]

To the extent possible under law, the author(s) have dedicated all copyright and related and neighboring rights to this software to the public domain worldwide. This software is distributed without any warranty.

You should have received a copy of the CC0 Public Domain Dedication along with this software. If not, see <http://creativecommons.org/publicdomain/zero/1.0/>.

They suggest this because, under the copyright law of many nations, anything you write automatically falls under copyright by you, and you somewhat paradoxically have assert your copyright to be able to license it or dedicate it to the public domain. Given that licensing is the focus of their organizations, it's probably best to take it seriously.

If the real problem is that it's a drag to list all of the contributors, could we solve this problem with code? e.g. something like using "git blame" to generate a list of all of the contributors to a given release and use that to create an accurate first line of the LICENSE file?

dbs commented 4 years ago

On that last point, building on this StackOverflow suggestion, here's what the current master branch list of contributors looks like:

$ git ls-tree -r -z --name-only HEAD -- */*.py | xargs -0 -n1 git blame --line-porcelain HEAD |grep  "^author "|sort|uniq -c|sort -nr
  17992 author Ed Summers
    856 author Jim Nicholls
    185 author Geoffrey Spear
    170 author Dan Michael O. Heggø
    135 author Dan Scott
    118 author Martin Czygan
    114 author Mark A. Matienzo
     54 author Gabriel Farrell
     51 author Victor Seva
     41 author Adam Constabaris
     38 author Mikhail Terekhov
     36 author cyperus-papyrus
     34 author Will Earp
     34 author David Chouinard
     25 author wrCisco
     19 author Sean Chen
     12 author Godmar Back
     12 author gitgovdoc
     11 author Simon Hohl
     10 author André Nesse
      9 author Karol Sikora
      7 author Robert Marchman
      5 author mmh
      4 author jmtaysom
      4 author Helga
      3 author eshellman
      3 author Edward Betts
      2 author Ted Lawless
      2 author Radim Řehůřek
      2 author Michael B. Klein
      2 author Dan Chudnov
      1 author nemobis
      1 author Michael J. Giarlo
      1 author Ed Hill

It's not perfect but maybe not so bad to run once per tag/release?

anarchivist commented 4 years ago

Incidentally, I switched a work project last fall to use all-contributors. I happen to like it, but I won't push for it too hard since it adds another set of dependencies (Node + associated modules) to the toolchain. (There's a bot too that uses Github actions, but I haven't used it.)

edsu commented 4 years ago

@anarchivist thanks for sharing that! How did you end up expressing the copyright statement--was it an institution?

edsu commented 4 years ago

@dbs that's a nice hack. Would it help with the copyright statement though?

anarchivist commented 4 years ago

@edsu yes, it was. (unfortunately 😄 )

dbs commented 4 years ago

@dbs that's a nice hack. Would it help with the copyright statement though?

It would help if you just concatenate the list of contributors to the copyright statement. So add in a |cut -c16-50 to the end of the bash command and then have a simple Python script concatenate the strings and update the README.md file for each tag/release:

Copyright (c) 2005-2020 Ed Summers, Jim Nicholls, Geoffrey Spear, Dan Michael O. Heggø, Dan Scott, Martin Czygan, Mark A. Matienzo, Gabriel Farrell, Victor Seva, Adam Constabaris, Mikhail Terekhov, cyperus-papyrus, Will Earp, David Chouinard, wrCisco, Sean Chen, Godmar Back, gitgovdoc, Simon Hohl, André Nesse, Karol Sikora, Robert Marchman, mmh, jmtaysom, Helga, eshellman, Edward Betts, Ted Lawless, Radim Řehůřek, Michael B. Klein, Dan Chudnov, nemobis, Michael J. Giarlo, Ed Hill

This could be adjusted to include email addresses, too, as those are more precise identifiers than just names. There are a few names in there that I suspect are just github nicks; if desired, a .mailmap file for the project could be used to expand those into full names & email addresses (or consolidate those who use multiple email addresses).

edsu commented 4 years ago

@dbs I don't think I've ever seen a copyright statement like that before. Have you?

dbs commented 4 years ago

@dbs I don't think I've ever seen a copyright statement like that before. Have you?

Nope. But it's the truth (or something close to it) if we extend the current practice of trying to use just a single copyright statement for the entire project, instead of applying license headers to each file in the project that state both the license of that file and the accurate list of copyright holders for / contributors to each file.

The latter approach is recommended by the GPL and Apache license, as well as the Open Source Initiative and the Producing Open Source Software book.

Note that the copyright & licensing best practices include specifying in which years a given file was actually modified by each contributor, instead of just giving the blanket range of years for the entire project.

See https://github.com/stp/stp/issues/199 for an example of a project that went through that exercise a few years ago.

My guess is nobody really wants to go through that level of granularity, however.

As an alternative to file-level copyright and licensing notices, the Software Freedom Law Center documents a centralized license notice. You still end up putting a header in each file, but that header simply states something like:

This file is part of pymarc. It is subject to the license terms in the LICENSE file found in the top-level directory of this distribution and at https://opensource.org/licenses/BSD-2-Clause. No part of pymarc, including this file, may be copied, modified, propagated, or distributed except according to the terms contained in the LICENSE file.

Given that the Software Freedom Law Center gravitated towards copyleft rather than permissive licenses, maybe that last sentence could be more encouraging for a BSD-2 project like pymarc, e.g. "pymarc may be copied, modified, propagated, or distributed according to the terms contained in the LICENSE file."

And then the LICENSE file contains the license and list of copyright statements. Something like:

Copyright 2005-2008, 2009, 2014 Ed Summers <ehs@pobox.org>
Copyright 2008 Gabriel Farrell <gsf747@gmail.com>
...

It would be easy to apply the headers, and a centralized LICENSE file with contributions broken out by year should be straightforward enough to generate periodically from the git logs.

I would be willing to work on a branch to implement the centralized license notice approach.

herrboyer commented 4 years ago

And what about switching the ownership of the project to a « pymarc » organization, transferring the license to it and maintaining an AUTHORS.md file to credit all the contributors ?

Copyright (c) 2005-2020 Pymarc organization
edsu commented 4 years ago

I've wanted to move edsu/pymarc to code4lib/pymarc for some time. But I wonder if moving it to a new pymarc organization might be a bit cleaner. According to this discussion, assigning copyright to a GitHub org should be ok as long as the existing copyright holders sign off on it?

So the proposal would be to:

@dbs & others what do you think?

dbs commented 4 years ago

The accepted response to that discussion states that the Github org would not be a legal entity and cannot hold copyright (1st and 4th paragraph). And the accepted response still recommends "a full list of named copyright responses" (4th paragraph).

So this doesn't really change much. You still need a centralized license header for each file. And you still need a list of the contributors, whether it's in a LICENSE file, a README, or a separate CONTRIBUTORS file.

I'm -1 to the idea of assigning copyright, especially to a non-legal entity. (I don't want to do it for journal articles, either...)

The bonus of the centralized license header approach suggested by the Software Freedom Law Center is that it doesn't include any dates, so we wouldn't have to go through and touch each file in 2021, 2022, etc. Just an automated process to update the copyright notices in the LICENSE file on a regular basis.

edsu commented 4 years ago

@dbs fair point, I guess I didn't really read that very closely :-) Let me merge in @herrboyer's Linting PR #150 and then it would be great if you would be willing to put together a PR that addresses how we might best deal with the licensing, if you are still up for that?

dbs commented 4 years ago

@edsu, I would be happy to do so!

dbs commented 4 years ago

See https://github.com/dbs/pymarc/tree/147_apply_license_headers for a rough first pass. It generates the LICENSE file with the list of contributors below the license, and adds headers to all Python files with the exception of docs/source/conf.py (which appears to be largely stock?)

The contributors are currently sorted by first name, but could pretty easily be sorted by number of commits or some other way if so desired.