IQSS / dataverse.harvard.edu

Custom code for dataverse.harvard.edu and an issue tracker for the IQSS Dataverse team's operational work, for better tracking on https://github.com/orgs/IQSS/projects/34
5 stars 1 forks source link

Spike: Add more licenses using multiple license feature #193

Closed jggautier closed 1 year ago

jggautier commented 1 year ago

The Harvard Dataverse Repository's license list includes only CC0. We'd like to add more licenses.

During a curation team meeting we decided to add the Creative Commons licenses listed at https://creativecommons.org/about/cclicenses.

To figure out what other licenses to add, we'd like to learn:

Steps for adding and editing licenses is at https://guides.dataverse.org/en/latest/installation/config.html#configuring-licenses

jggautier commented 1 year ago

Research and findings

Questions

Recommendations and next steps

jggautier commented 1 year ago

Stefano recommended reviewing the licenses supported by repositories listed in the comparison chart at https://zenodo.org/record/7189481#.Y1GfJOzMI8Z.

In the research I've described in my earlier comments in this issue, I've looked into many of these licenses, but there are some that I haven't seen before, such as Artistic 2.0 from OSF. So I'll look at the nature of those licenses and if datasets in the Harvard repo have used them.

jggautier commented 1 year ago

The licenses listed on page 2 of the comparison chart at https://zenodo.org/record/7189481#.Y1KmfezMI8a that I haven't already reviewed are the CERN OHL and TAPR OHL licenses, Artistic 2.0, and the LGPL licenses. All of these licenses are meant to be applied to software.

I've added this review to this issue's second comment so everything's in one place.

mreekie commented 1 year ago

@jggautier in reviewing this with Stefano it sounds like this is a policy issue.

Is what's needed here a discussion of which licenses to implement? The right people are you, sonia, and stefano. We can put this in as a spike with those folks being assigned during the sprint.

To do this:

Context:

mreekie commented 1 year ago

Technically this is likely easy. The work here is deciding which licensing. This is not a dev thing. Stefano, sonia, and julian need to decide.

There will be some cost here from a dev who will need to add this to production. Leonid or Julian may be able to do this work. The API is a super-user API, but not /admin.

Sizing: No dev effort allocated.

mreekie commented 1 year ago

Confirmatin:

mreekie commented 1 year ago

Julian

mreekie commented 1 year ago

Touched base with Julian.

sbarbosadataverse commented 1 year ago

@mreekie yes apparently into January at this point. Julian starts vacation next week, and I start the following week.

mreekie commented 1 year ago

added to sprint Dec 15, 2022

mreekie commented 1 year ago

Sonia is out this week. Will definitely slip into next week, possibly into next sprint.

mreekie commented 1 year ago

Julian/Phil discussed PR #9262

pdurbin commented 1 year ago

Right, I think @qqmyers and I agree that as long as @jggautier sticks to licenses that are in the guides (instead of cooking up new JSON files) he shouldn't be blocked on waiting for this PR:

The reason is that we'll need to migrate licenses for other installations to the new database schema anyway. So there's no need to wait.

jggautier commented 1 year ago

Thanks. Just want to clarify that "as long as @jggautier sticks to licenses that are in guides" should really be "as long as the Harvard Dataverse Repository sticks to licenses that are in the guides...". I don't want to give the impression that I'm making the final decision about which licenses to use or even that my opinion weighs more.

Lastly, I am recommending in this issue that Harvard's repo consider using some widely-used licenses that are not in the list at #8512 as of this writing. But it looks to me like that should be okay as long as we follow the principles of:

pdurbin commented 1 year ago

@jggautier can we add JSON files to https://github.com/IQSS/dataverse/pull/9262 to keep Harvard Dataverse in sync with licenses what will be offered in the guides? Sounds like they're widely used and not Harvard-specific.

jggautier commented 1 year ago

Wouldn't that make https://github.com/IQSS/dataverse/pull/9262 dependent on this issue? If https://github.com/IQSS/dataverse/pull/9262 is merged before Harvard's repo adds more licenses, can't we just later add JSON files for the licenses that aren't added in https://github.com/IQSS/dataverse/pull/9262?

And isn't https://github.com/IQSS/dataverse/pull/9262 more about standardizing how licenses are represented in the metadata exports, and less about standardizing which licenses are used?

jggautier commented 1 year ago

An update about the status of this. The licenses to choose will be discussed during the next curation team meeting on Feb. 9.

I'm not clear enough about all of the changes I see in https://github.com/IQSS/dataverse/pull/9262. But I think @pdurbin suggested in a chat we had last week that once the curation team chooses the licenses:

So we won't follow the two principles I wrote about last week, and if or when changes to the JSON files are made (that is, the way that each license is represented), the Harvard repo will need to make those changes.

During a meeting last year the curation team talked about how we can always add more licenses as needed, and I'd like to plan for how to figure out what's needed. Curators and depositors might ask us to add a license to the dropdown menu, but I also think we could plan to actively monitor the use of those licenses and what people are still typing in the Terms fields. For example, if we see that depositors add well-known licenses in the Terms fields that aren't in the dropdown menu, we could consider adding them to the dropdown menu.

jggautier commented 1 year ago

This wasn't discussed on Feb. 9. @sbarbosadataverse's going to schedule a time for the curation team to meet about this.

jggautier commented 1 year ago

@sbarbosadataverse and I met today to discuss this and Sonia is scheduling a meeting with others on the curation team to discuss more.

Here's what we discussed today:

qqmyers commented 1 year ago

FWIW: The license mechanism distinguishes between which are installed and which of those are active/show up in the edit menu as options. Hopefully that handles the versioning bullet above.

jggautier commented 1 year ago

Thanks, yeah that seems helpful if installation admins want to update datasets that have previous versions of licenses and then "deactivate" those previous versions so that they can't be applied to other datasets (unless we "activate" them again).

What I wasn't sure about is if we should keep those older versions of the licenses active, so depositors can continue to apply them, either when updating their datasets to take advantage of the dropdown, when adding new datasets, or when migrating datasets from someplace else. But looking at the count of CC licenses used, most datasets use 4.0 versions and relatively very few use older, pre-4.0 versions. And the Digital Curation Centre actually discourages the use of pre-4.0 CC licenses for research data.

Estimated count of CC licenses used I wrote yesterday that I'd get a count of datasets using each CC license, so here's an estimate*:

Screen Shot 2023-02-16 at 12 18 13 PM

Recommended list of licenses to use So if we don't include pre-4.0 versions of CC licenses (which includes the "ported" versions of licenses like https://creativecommons.org/licenses/by/2.5/it), that leaves 12 options in the dropdown menu:

6 CC licenses and the 1 waiver: https://creativecommons.org/publicdomain/zero/1.0 (CC0) https://creativecommons.org/licenses/by/4.0 https://creativecommons.org/licenses/by-nc/4.0 https://creativecommons.org/licenses/by-nc-nd/4.0 https://creativecommons.org/licenses/by-nc-sa/4.0 https://creativecommons.org/licenses/by-nd/4.0 https://creativecommons.org/licenses/by-sa/4.0

The 4 non-CC licenses that the Digital Curation Centre recommends: Open Data Commons Public Domain Dedication and License (PDDL) — “Public Domain for data/databases” Open Data Commons Attribution License — “Attribution for data/databases” Open Data Commons Open Database License (ODbL) — “Attribution Share-Alike for data/databases” Open Government License (OGL): https://www.nationalarchives.gov.uk/doc/open-government-licence

The "Custom Terms" option

Adding the licenses This zip file has JSON files for the 10 new licenses: json_for_adding_licenses_to_hdv.zip

I took the 7 JSON files for the CC waiver and licenses in the Dataverse installation guide, changing the sort order for some, and I created 4 JSON files for the 3 Open Data Commons licenses and the 1 Open Government License. (The draft pull request for https://github.com/IQSS/dataverse/issues/8512 includes JSON files for two of the three Open Data Commons licenses, so reviewed those, too.)

When the API is used to add them, they'll appear with

Here's the order (in the UI the license "name" is shown):

  1. https://creativecommons.org/publicdomain/zero/1.0
  2. https://creativecommons.org/licenses/by/4.0
  3. https://creativecommons.org/licenses/by-nc/4.0
  4. https://creativecommons.org/licenses/by-nc-nd/4.0
  5. https://creativecommons.org/licenses/by-nc-sa/4.0
  6. https://creativecommons.org/licenses/by-nd/4.0
  7. https://creativecommons.org/licenses/by-sa/4.0
  8. Open Data Commons Public Domain Dedication and License (PDDL) — “Public Domain for data/databases”
  9. Open Data Commons Attribution License — “Attribution for data/databases”
  10. Open Data Commons Open Database License (ODbL) — “Attribution Share-Alike for data/databases”
  11. Open Government License (OGL): https://www.nationalarchives.gov.uk/doc/open-government-licence
  12. "Custom Terms" option

We could also add the JSON files for those two non-CC licenses to the list in the installation guide.

Improving the list After the licenses are added to the dropdown menu, we'll keep an ear out for what other licenses depositors request and an eye out for what licenses depositors enter as "custom terms".

If the list of licenses that depositors see grows, we should consider how to let depositors more easily find the licenses they need. This might already be a concern for installations with many licenses listed, so it could be a discussion for the Dataverse community.

*This is from metadata collected in October 2022 but the number of datasets published since then hasn't increased that much, so I think it's still a good enough estimate. And to simplify things, I looked only for URLs of CC licenses in the Terms fields. So for example these counts don't include cases where the depositor entered the name of a CC license but not the URL, such as the dataset at https://doi.org/10.7910/DVN/0BEZK5. But of the roughly 32,000 datasets with "custom terms" in October 2022, about 27,000 datasets had URLs of CC licenses in one of the Terms fields.

jggautier commented 1 year ago

I added the 10 new licenses to the Harvard Dataverse. Also added them to the Demo Dataverse.

I'll open a new issue in the Dataverse GitHub repo about adding to the Dataverse Guides the JSON files for the four non-Creative Commons licenses that aren't there as of this writing.

jggautier commented 1 year ago

In three months I'll look at how people have been applying licenses and terms metadata since the new licneses were added in the Harvard Dataverse. So we can see how often each license is being used, which licenses people continue to use that aren't in the dropdown list, and learn about other ways we might help people add machine-readable licenses and terms to their data.

jggautier commented 5 months ago

Related to https://github.com/IQSS/dataverse.harvard.edu/issues/248

jggautier commented 5 months ago

I wrote that in May 2023 I'd look at how people were applying licenses and terms metadata since the new licenses were added in Harvard Dataverse. I'm very late, but here's a Google Sheet with the 8378 datasets that been published since the new licenses were added in late February 2023:

https://docs.google.com/spreadsheets/d/1keYomqTbJ6Yu8EkEtlrezYixR2BG4WbWKlfGsUeWe-8

The spreadsheet's first tab shows the counts of datasets that use each license or use no license.

I haven't looked closely at this yet, but last year I do remember seeing these cases:

I also haven't looked closely at cases where depositors have used the Terms of Use fields to enter standard licenses that we might consider adding to the License/Data Use Agreement dropdown menu.

Lastly, after the MIT license is added to the dropdown list as part of https://github.com/IQSS/dataverse.harvard.edu/issues/248, I can update the spreadsheet again, hopefully sooner this time!, so that we can see how that license's been used.