Closed jggautier closed 1 year ago
Research and findings
What are other Dataverse installations using?
At least 36 Dataverse installations have the multiple license update (running v5.10 or later) as of October 2022. A list of those installations' licenses is in the Google Sheet at https://docs.google.com/spreadsheets/d/1v5a1M3sQdVyNS0aeabm34MFHJmmaQ8JANIaxpNMlGrg.
In the Harvard repository, I took a look at what standard licenses people have entered into the Terms fields when they don't select the CC0 waiver. It's mostly different types of Creative Commons licenses.
(A Jupyter notebook includes results from a search of licenses used in datasets published in the Harvard repository.)
Generally recommended licenses
The Digital Curation Centre maintains a page at https://www.dcc.ac.uk/guidance/how-guides/license-research-data that recommends Creative Commons licenses and two other types of data licenses:
Some datasets in the Harvard repository have used an Open Data Commons license. None have used the UK National Archives Open Government License, but some datasets have used Canada's Open Data Commons license (https://open.canada.ca/en/open-government-licence-canada)
What are other repository platforms, e.g. Figshare and Zenodo, using?
Representatives from several generalist repositories collect information about their repositories into a table at https://doi.org/10.5281/zenodo.3946719. The table lists the licenses that each repository makes available to depositors to apply to their deposits.
Most of these are licenses I've already reviewed. The licenses I haven't looked at - the CERN OHL and TAPR OHL licenses, Artistic 2.0, and the LGPL licenses - are all meant to be applied to software. So I haven't considered them further.
Questions
Should licenses meant only for software be included in the list?
Most of the non-Creative Commons licenses made available by other Dataverse repositories, and some licenses that other repositories platforms make available, are licenses used only for software and code, such as Apache, BSD and GNU licenses. For now, I think we should not include licenses meant only for software and code. The Harvard Dataverse Repository requires that depositors always include data, and the recommended way to apply different licenses for other parts of a dataset, such as applying one license for the data files and a different one for software files, is to create a custom license (e.g. by entering text in the Terms fields or by the installation creating custom licenses that depositors can choose from the list of licenses).
What license names should we use to improve discovery across repositories?
The Dataverse community is discussing how licenses should be named (see https://github.com/IQSS/dataverse/issues/8512), so that when dataset metadata from one repository can be searched in other repositories (for example, see https://github.com/IQSS/dataverse/issues/9060), all of the records are using the same license name and can be filtered by those names. The goal is to avoid the issues we see when the name author name is spelled several different ways and the different spellings are displayed in the search facets.
Particularly for the Creative Commons license, there's discussion about how many dashes the license names should have. Right now, Harvard's repository and most other repositories use "CC0 1.0" as the name for the CC0 waiver. Some in the community recommend normalizing the license names by always using dashes instead of spaces, e.g. CC0-1.0.
We might want to hold off on adding more licenses to the list that depositors can choose in the Harvard repository until the community agrees on a solution, depending on how soon the community can agree on a solution. Otherwise, if we add licenses that depositors start using, it might take some development work (e.g. database edits) to rename the licenses if the Dataverse community decides to format the names differently later on.
On the other hand, we might not want to wait if the community takes too long to find consensus. How long is too long to wait?
Should we include multiple versions of each license?
The Harvard repository's curation team agrees that we should include Creative Commons licenses in the list of licenses that depositors can choose, but should the older versions of those licenses be included?
I think so, if only so that the depositors of published datasets with older license versions can update those datasets with the same license version from the deposit form's dropdown list, making those datasets' license metadata more machine readable.
Should depositors apply the newest version of a license to their datasets?
Should depositors use newer versions of licenses, such as CC-BY 4.0 instead of CC-BY 3.0 ? If depositors should use the newest versions of licenses, and if the older versions of licenses should also be in the list of licenses they can choose, how can we encourage depositors to use the newest version?
Recommendations and next steps
Stefano recommended reviewing the licenses supported by repositories listed in the comparison chart at https://zenodo.org/record/7189481#.Y1GfJOzMI8Z.
In the research I've described in my earlier comments in this issue, I've looked into many of these licenses, but there are some that I haven't seen before, such as Artistic 2.0 from OSF. So I'll look at the nature of those licenses and if datasets in the Harvard repo have used them.
The licenses listed on page 2 of the comparison chart at https://zenodo.org/record/7189481#.Y1KmfezMI8a that I haven't already reviewed are the CERN OHL and TAPR OHL licenses, Artistic 2.0, and the LGPL licenses. All of these licenses are meant to be applied to software.
I've added this review to this issue's second comment so everything's in one place.
@jggautier in reviewing this with Stefano it sounds like this is a policy issue.
Is what's needed here a discussion of which licenses to implement? The right people are you, sonia, and stefano. We can put this in as a spike with those folks being assigned during the sprint.
To do this:
Context:
Technically this is likely easy. The work here is deciding which licensing. This is not a dev thing. Stefano, sonia, and julian need to decide.
There will be some cost here from a dev who will need to add this to production. Leonid or Julian may be able to do this work. The API is a super-user API, but not /admin.
Sizing: No dev effort allocated.
Confirmatin:
Julian
Touched base with Julian.
@mreekie yes apparently into January at this point. Julian starts vacation next week, and I start the following week.
added to sprint Dec 15, 2022
Sonia is out this week. Will definitely slip into next week, possibly into next sprint.
Julian/Phil discussed PR #9262
Right, I think @qqmyers and I agree that as long as @jggautier sticks to licenses that are in the guides (instead of cooking up new JSON files) he shouldn't be blocked on waiting for this PR:
The reason is that we'll need to migrate licenses for other installations to the new database schema anyway. So there's no need to wait.
Thanks. Just want to clarify that "as long as @jggautier sticks to licenses that are in guides" should really be "as long as the Harvard Dataverse Repository sticks to licenses that are in the guides...". I don't want to give the impression that I'm making the final decision about which licenses to use or even that my opinion weighs more.
Lastly, I am recommending in this issue that Harvard's repo consider using some widely-used licenses that are not in the list at #8512 as of this writing. But it looks to me like that should be okay as long as we follow the principles of:
CC0-1.0
instead of CC0 1.0
@jggautier can we add JSON files to https://github.com/IQSS/dataverse/pull/9262 to keep Harvard Dataverse in sync with licenses what will be offered in the guides? Sounds like they're widely used and not Harvard-specific.
Wouldn't that make https://github.com/IQSS/dataverse/pull/9262 dependent on this issue? If https://github.com/IQSS/dataverse/pull/9262 is merged before Harvard's repo adds more licenses, can't we just later add JSON files for the licenses that aren't added in https://github.com/IQSS/dataverse/pull/9262?
And isn't https://github.com/IQSS/dataverse/pull/9262 more about standardizing how licenses are represented in the metadata exports, and less about standardizing which licenses are used?
An update about the status of this. The licenses to choose will be discussed during the next curation team meeting on Feb. 9.
I'm not clear enough about all of the changes I see in https://github.com/IQSS/dataverse/pull/9262. But I think @pdurbin suggested in a chat we had last week that once the curation team chooses the licenses:
So we won't follow the two principles I wrote about last week, and if or when changes to the JSON files are made (that is, the way that each license is represented), the Harvard repo will need to make those changes.
During a meeting last year the curation team talked about how we can always add more licenses as needed, and I'd like to plan for how to figure out what's needed. Curators and depositors might ask us to add a license to the dropdown menu, but I also think we could plan to actively monitor the use of those licenses and what people are still typing in the Terms fields. For example, if we see that depositors add well-known licenses in the Terms fields that aren't in the dropdown menu, we could consider adding them to the dropdown menu.
This wasn't discussed on Feb. 9. @sbarbosadataverse's going to schedule a time for the curation team to meet about this.
@sbarbosadataverse and I met today to discuss this and Sonia is scheduling a meeting with others on the curation team to discuss more.
Here's what we discussed today:
We talked about license versions. Sonia would like the dropdown menu to list only the most recent versions of each license. And we would consider cases where many datasets have used an older version of a license. We could ask the depositor if they can update the datasets to use the newest version of the license. If that's not possible, we could add that older version of the license to the dropdown menu, then the depositor can update their datasets.
I just noticed that the Digital Curation Centre's data license guide talks about the differences between the versions of the CC licenses. Looks like each version has been more permissive and less restrictive than the previous version.
FWIW: The license mechanism distinguishes between which are installed and which of those are active/show up in the edit menu as options. Hopefully that handles the versioning bullet above.
Thanks, yeah that seems helpful if installation admins want to update datasets that have previous versions of licenses and then "deactivate" those previous versions so that they can't be applied to other datasets (unless we "activate" them again).
What I wasn't sure about is if we should keep those older versions of the licenses active, so depositors can continue to apply them, either when updating their datasets to take advantage of the dropdown, when adding new datasets, or when migrating datasets from someplace else. But looking at the count of CC licenses used, most datasets use 4.0 versions and relatively very few use older, pre-4.0 versions. And the Digital Curation Centre actually discourages the use of pre-4.0 CC licenses for research data.
Estimated count of CC licenses used I wrote yesterday that I'd get a count of datasets using each CC license, so here's an estimate*:
Recommended list of licenses to use So if we don't include pre-4.0 versions of CC licenses (which includes the "ported" versions of licenses like https://creativecommons.org/licenses/by/2.5/it), that leaves 12 options in the dropdown menu:
6 CC licenses and the 1 waiver: https://creativecommons.org/publicdomain/zero/1.0 (CC0) https://creativecommons.org/licenses/by/4.0 https://creativecommons.org/licenses/by-nc/4.0 https://creativecommons.org/licenses/by-nc-nd/4.0 https://creativecommons.org/licenses/by-nc-sa/4.0 https://creativecommons.org/licenses/by-nd/4.0 https://creativecommons.org/licenses/by-sa/4.0
The 4 non-CC licenses that the Digital Curation Centre recommends: Open Data Commons Public Domain Dedication and License (PDDL) — “Public Domain for data/databases” Open Data Commons Attribution License — “Attribution for data/databases” Open Data Commons Open Database License (ODbL) — “Attribution Share-Alike for data/databases” Open Government License (OGL): https://www.nationalarchives.gov.uk/doc/open-government-licence
The "Custom Terms" option
Adding the licenses This zip file has JSON files for the 10 new licenses: json_for_adding_licenses_to_hdv.zip
I took the 7 JSON files for the CC waiver and licenses in the Dataverse installation guide, changing the sort order for some, and I created 4 JSON files for the 3 Open Data Commons licenses and the 1 Open Government License. (The draft pull request for https://github.com/IQSS/dataverse/issues/8512 includes JSON files for two of the three Open Data Commons licenses, so reviewed those, too.)
When the API is used to add them, they'll appear with
Here's the order (in the UI the license "name" is shown):
We could also add the JSON files for those two non-CC licenses to the list in the installation guide.
Improving the list After the licenses are added to the dropdown menu, we'll keep an ear out for what other licenses depositors request and an eye out for what licenses depositors enter as "custom terms".
If the list of licenses that depositors see grows, we should consider how to let depositors more easily find the licenses they need. This might already be a concern for installations with many licenses listed, so it could be a discussion for the Dataverse community.
*This is from metadata collected in October 2022 but the number of datasets published since then hasn't increased that much, so I think it's still a good enough estimate. And to simplify things, I looked only for URLs of CC licenses in the Terms fields. So for example these counts don't include cases where the depositor entered the name of a CC license but not the URL, such as the dataset at https://doi.org/10.7910/DVN/0BEZK5. But of the roughly 32,000 datasets with "custom terms" in October 2022, about 27,000 datasets had URLs of CC licenses in one of the Terms fields.
I added the 10 new licenses to the Harvard Dataverse. Also added them to the Demo Dataverse.
I'll open a new issue in the Dataverse GitHub repo about adding to the Dataverse Guides the JSON files for the four non-Creative Commons licenses that aren't there as of this writing.
In three months I'll look at how people have been applying licenses and terms metadata since the new licneses were added in the Harvard Dataverse. So we can see how often each license is being used, which licenses people continue to use that aren't in the dropdown list, and learn about other ways we might help people add machine-readable licenses and terms to their data.
I wrote that in May 2023 I'd look at how people were applying licenses and terms metadata since the new licenses were added in Harvard Dataverse. I'm very late, but here's a Google Sheet with the 8378 datasets that been published since the new licenses were added in late February 2023:
https://docs.google.com/spreadsheets/d/1keYomqTbJ6Yu8EkEtlrezYixR2BG4WbWKlfGsUeWe-8
The spreadsheet's first tab shows the counts of datasets that use each license or use no license.
I haven't looked closely at this yet, but last year I do remember seeing these cases:
I also haven't looked closely at cases where depositors have used the Terms of Use fields to enter standard licenses that we might consider adding to the License/Data Use Agreement dropdown menu.
Lastly, after the MIT license is added to the dropdown list as part of https://github.com/IQSS/dataverse.harvard.edu/issues/248, I can update the spreadsheet again, hopefully sooner this time!, so that we can see how that license's been used.
The Harvard Dataverse Repository's license list includes only CC0. We'd like to add more licenses.
During a curation team meeting we decided to add the Creative Commons licenses listed at https://creativecommons.org/about/cclicenses.
To figure out what other licenses to add, we'd like to learn:
Steps for adding and editing licenses is at https://guides.dataverse.org/en/latest/installation/config.html#configuring-licenses