IQSS / dataverse

Open source research data repository software
http://dataverse.org
Other
876 stars 484 forks source link

Feature Request/Idea: Standardize standard license configuration #8512

Open philippconzett opened 2 years ago

philippconzett commented 2 years ago

Overview of the Feature Request With version 5.10, the long-awaited multiple-license support was released (see release notes). Thanks to all contributors! To better support interoperability between Dataverse installations and beyond Dataverse installations, I'd like to suggest standardizing the way standard license configuration is managed using multiple-license support as follows:

  1. Standardized licenses are provided as authoritative JSON files stored in the IQSS/dataverse GitHub repositories or the GDCC GitHub repository.
  2. The section Adding Licenses in the Dataverse Installation Guide links to GitHub folder containing the JSON files.
  3. We agree on the source and content of the elements in the JSON files. Here are some suggestions and possible issues to discuss:
    • Source: Whenever possible, we use the information provided in the SPDX License List and the license webpage provided by the license issuer. JSON elements:
    • name: Field Identifier in the SPDX License List, but without hyphens, e.g., Artistic-2.0 > Artistic 2.0
    • uri: 1) License URI provided by the license issuer; 2) if (1) is not available, SPDX URI for the license
    • shortDescription: Field Full name in the SPDX License List. Question 1: Do we need to add "License" or "Dedication", as is currently done in the JSON files provided in the Dataverse Installation Guide? Question 2: Do we need to add a full stop at the end of the shortDescription element, as is currently done in the JSON files provided in the Dataverse Installation Guide?
    • iconUrl: As provided by license issuer
  4. We agree on which standard licenses to provide as JSON files in the GitHub repository. To start with, I suggest we concentrate on the following ones:
    • Creative Commons Zero 1.0
    • All Creative Commons Attribution (BY) licenses 4.0 and later
    • Open Data Commons Open Database License v1.0
    • Open Data Commons Attribution License v1.0
    • All licenses in the SPDX License List that are FSF Free/Libre and OSI Approved, starting with licenses included in Open Science Framework (OSF):

Content:

Code - Permissive:

Code - Copyleft:

Code - Other:

Following the suggested guidelines above, I have created a Google spreadsheet containing the necessary information to create JSON files, and I created those files by running a bash file. All these documents are available in this Google folder (you might need to log in to access it).

At a later stage, this could of course be automated by retrieving information directly from SPDX and license issuers, possibly via a controlled vocabulary hosted on SKOSMOS.

What kind of user is the feature intended for? The suggested feature is primarily intended for Sysadmins who need to install licenses on their Dataverse installation.

What inspired the request? The implementation of multiple license support released in v5.10.

What existing behavior do you want changed? The different Dataverse installations adding the same standard license with (slightly) different license information.

Any brand new behavior do you want to add to Dataverse? No, thanks.

Any related open or closed issues to this feature request? Multiple licences feature proposal #7440

qqmyers commented 2 years ago

FWIW:

philippconzett commented 2 years ago

Thanks for your comments, @qqmyers. Just a short reply to your first bullet point. I think the way you have done this in v5.10 is already in line with my suggestion; cf.

uri: 1) License URI provided by the license issuer; 2) if (1) is not available, SPDX URI for the license

I think we only should use the SPDX URI when there is no (authoritative) URI provided by the license issuer.

poikilotherm commented 2 years ago

Thank you @philippconzett starting this discussion. This is related to proper future software support (important for our project HERMES), so I'm taking the liberty to join it.

As 5.10 included the first iteration of multi license support, I think we should be very careful when taking the next steps.

Some context about interoperability: RO-Crate 1.1 uses this JSON-LD schema.org based representation of a license:

{
  "@id": "https://creativecommons.org/licenses/by/4.0/",
  "@type": "CreativeWork",
  "name": "CC BY 4.0",
  "description": "Creative Commons Attribution 4.0 International License"
}

@qqmyers removed the generation of this JSON-LD part from the code and replaced it with the URL only (which is perfectly valid schema.org syntax). The RO-Crate description field is our shortDescription, RO-Crate name stays name, etc. (We might even use our iconUrl for schema.org/RO-Crate thumbnailUrl)

@qqmyers: looking at https://github.com/spdx/license-list-data/blob/master/json/licenses.json, there are licenses marked as deprecated - maybe we need to open an issue at https://github.com/spdx/license-list-XML and talk to them about PDDC being deprecated by upstream (there isn't an issue for this yet).

(Future: IMHO it would be great to have a summary in our UI, so people don't need to look at license texts. Maybe grabbing the quick summaries from https://tldrlegal.com helps?)

philippconzett commented 2 years ago

Thanks for your feedback, @poikilotherm! I wasn't aware that RO-Crate already had addressed this issue. My main concern was just that to make sure that standard licenses are described in the same way across Dataverse installations.

philippconzett commented 2 years ago

When I mentioned that my suggestion was meant to improve interoperability between Dataverse installations and beyond Dataverse installations, I first of all had in mind that license information from Dataverse installations should be made harvestable in a way that complies with recommendations. I'm not sure about the status of RO-Crate, but a standard that is already implemented and widely used is the DataCite Metadata Schema. The current version of this schema, v.4.4 (cf. https://schema.datacite.org/meta/kernel-4.4/), says the following about license information:

ID DataCite-Property Occ Definition Allowed values, examples, other constraints
16 Rights 0-n Any rights information for this resource. The property may be repeated to record complex rights characteristics. Free text *** Provide a rights management statement for the resource or reference a service providing such information. Include embargo information if applicable. Use the complete title of a license and include version information if applicable. May be used for software licenses. Examples: Creative Commons Attribution; 3.0 Germany License; Apache License, Version 2.02
16.a rightsURI 0-1 The URI of the license. Example: https://creativecommons.org/licenses/by/3.0/de/
16.b rightsIdentifier 0-1 A short, standardized version of the license name. Example: CC-BY-3.0. A list of identifiers for commonly-used licenses may be found here: (https://spdx.org/licenses/).
16.c rightsIdentifierScheme 0-1 The name of the scheme. Example: SPDX
16.d schemeURI 0-1 The URI of the rightsIdentifierScheme. Example: https://spdx.org/licenses/

As the license identifier, DataCite requires "a short, standardized version of the license name", and they suggest to use the SPDX identifier.

Based on the DataCite recommendations, I've updated the Google spreadsheet (see tab "English v.0.2") and the JSON files for the standard licenses I suggest we should provide on GitHub; see this Google folder.

As far as I can see, none of the standard licenses I suggest we should provide on GitHub are obsolete, so this shouldn't be a show stopper. Also pinging @janvanmansum for feedback.

philippconzett commented 2 years ago

Here are two JSON examples created following the suggested workflow above:

{
  "rightsName": "CC0 1.0",
  "rightsURI": "https://creativecommons.org/publicdomain/zero/1.0/",
  "rightsIdentifier": "CC0-1.0",
  "rightsIdentifierScheme": "SPDX",
  "schemeURI": "https://spdx.org/licenses/",
  "rightsShortDescription": "Creative Commons Zero v1.0 Universal.",
  "rightsIconUrl": "https://licensebuttons.net/p/zero/1.0/88x31.png",
  "rightsActive": true
}
{
  "rightsName": "CC BY 4.0",
  "rightsURI": "https://creativecommons.org/licenses/by/4.0/",
  "rightsIdentifier": "CC-BY-4.0",
  "rightsIdentifierScheme": "SPDX",
  "schemeURI": "https://spdx.org/licenses/",
  "rightsShortDescription": "Creative Commons Attribution 4.0 International.",
  "rightsIconUrl": "https://licensebuttons.net/l/by/4.0/88x31.png",
  "rightsActive": true
}

@qqmyers @pdurbin I guess we might have to change back some of the field names, in order to this not messing up your current setup, e.g., rightsName >> name?

I don't know what needs to be done to discuss this further, but I'd be happy to contribute as suggested above. For example, if you create a suitable place on GitHub, I could create and upload the JSON files, once we've agreed on how they should look like. Thanks!

pdurbin commented 2 years ago

I don't know what needs to be done to discuss this further

@philippconzett I'm not sure either. Perhaps we can try to make the problem more concrete with a scenario and a screenshot.

Imagine a future where you're harvesting datasets from another Dataverse installation with slightly different names. Also imagine that there's a search facet called "License" that makes these differences obvious at a glance:

Screen Shot 2022-03-28 at 2 02 21 PM

Once the data is in a facet like this, it's obvious that there's a problem, that counts of the same license should be combined.

philippconzett commented 2 years ago

Thanks, @pdurbin, and sorry for my late reply.

The scenario you described above is definitely an example of what might be an undesired result of the current way of configuring standard licenses. A similar situation could arise in search engines supporting search/filtering based on license information, e.g., in the advanced search of BASE (https://www.base-search.net/Search/Advanced); cf. this mock-up screenshot:

image

In general, I think we should aim at providing license information along the recommendations of DataCite.

I'd be happy to create a pull request, but I need some help:

I suggest we make this a prioritized PR because the longer we wait, the more likely it becomes that installations configure multi-license support with the current set-up, which means that they would have to do some clean up to change the license information to be aligned with the standardized way suggested in this issue.

pdurbin commented 2 years ago

@philippconzett thanks. If the goal is to keep the Dataverse community together perhaps the best place for the JSON files is where they already are, in the main repo. That way, they seem more official, they can be part of the guides, and if the JSON structure needs to evolve (new fields/columns like you say), it can happen in the same pull request as the code and database changes.

I guess what I'm saying is, what if we consider the licenses in the main repo official already? And if we don't like something about them (they need more or different fields), what if we let them evolve in the main repo, at least for a while?

There are currently 453 licenses in your spreadsheet. If we were start adding more licenses to the main repo, would you want all of them at once? (Do you plan to present all 453 to your users?) A subset? How many? Thanks. For others, here's a link to your spreadsheet: https://docs.google.com/spreadsheets/d/1f_-z6vWijOvIc0tI1ezWeDEgM3U9w5qynllfyNqWYU8/edit?usp=sharing

philippconzett commented 2 years ago

Thanks, Phil!

Keeping the JSON files in the main repo sounds reasonable.

As for the number of licenses/JSON files, I only suggest to start with a small selection, as described above; see point 4 in the first posting. These 28 licenses are all marked with "true" in column M (=active) in the spreadsheet. I have now sorted the spreadsheet to make them appear on top. The JSON files of these licenses are in the folder "JSON files v.0.2" in the share Google folder: https://drive.google.com/drive/folders/11BF5tZ9K_S0rxrWErFQYgSCX_geQtHtq?usp=sharing.

jggautier commented 2 years ago

Thanks for pinging me @philippconzett. This issue reminds me of that "things, not strings" saying, which I think is usually used when talking about knowledge graphs, but it makes sense here. I think your idea in this issue will improve the chances that most Dataverse installations will use the same strings to describe the same things.

I'm less sure it would improve interoperability "beyond Dataverse installations". What if, when a Dataverse repository that prefers displaying a "CC-0" license as "CC 0" harvests metadata from a source that uses "CC0", the Dataverse software could figure out that "CC0" is the same thing as "CC-0" and use that when displaying search results (like as facets)? Since the Dataverse software doesn't have facets for the Terms metadata, this problem isn't as noticeable now, so maybe we can cross that bridge when we get to it.

djbrooke commented 2 years ago

Hi all! I hope everyone is doing well.

I noted a similar problem in a different community, and just as a point of information it may be interesting to follow how they solve it: https://github.com/huggingface/datasets/issues/4298

philippconzett commented 2 years ago

Thanks, @jggautier + @djbrooke!

@jggautier I'm not sure I agree with you on interoperability beyond Dataverse installations. In my understanding, the main point with the DataCite Metadata Schema recommendations is to make harvested metadata interoperable. Of course, Dataverse, Dataverse installations or DataCite could create crosswalks/scripts to transform the exposed metadata into the desired DataCite format, but why not make the metadata available in a DataCite-aligned way to start with?

I now realize that starting a discussion like this on GitHub is no good idea, as only a few people in the community systematically review GitHub issues. I'll raise the issue in the Dataverse Google group, because I think DataCite-aligned metadata is important for many Dataverse installations. Thanks!

poikilotherm commented 2 years ago

Please note, as I recently learned, that the Datacite Metadata Export exposed via OAI-PMH is not valid XML. The export also uses an outdated schema and a subset of the schemas possibilities (example is #7077).

I agree with you we should discuss this somewhere else to include more people's views.

philippconzett commented 2 years ago

I've raised the issue in the Dataverse Google group: https://groups.google.com/u/1/g/dataverse-community/c/4qSr0mkcyOw.

philippconzett commented 2 years ago

I'm adding another illustration of why this feature request should be prioritized: Metadata from Dataverse-based repositories are currently not correctly harvested by DataCite. This includes the license information. So, if you compare a DataCite metadata record from let's say Pangaea, e.g., https://search.datacite.org/works/10.1594/pangaea.940188, you can download the metadata in different formats, and you'll find correct license information:

"rightsList": [
    {
      "rights": "Creative Commons Attribution 4.0 International",
      "rightsUri": "https://creativecommons.org/licenses/by/4.0/legalcode",
      "schemeUri": "https://spdx.org/licenses/",
      "rightsIdentifier": "cc-by-4.0",
      "rightsIdentifierScheme": "SPDX"
    }

Based on this license information, the metadata are then harvested and indexed in other discovery services, e.g., Primo (see this discussion thread in the Dataverse Google group).

On the other hand, Dataverse-based repositories do not expose license information in the way DataCite expects, and thus the DataCite metadata records from Dataverse-based repositories are lacking license information. Here's an example from DataverseNO, and here's one from DataverseNL (@janvanmansum @4tikhonov), here one from the Australian Data Archive (@stevenmce), here one from Harvard Dataverse (@pdurbin @jggautier), here one from Jülich DATA (@poikilotherm), here one from Odum (@donsizemore), and here one from Scholars Portal (@amberleahey @kaitlinnewson @meghangoodchild). As you see (cf. the DataCite JSON file), the rightslist is empty:

"rightsList": [],

As a result, if you search for data in Dataverse-based repositories in discovery services like Primo, you'll be told that you cannot access these datasets. There reason for this being that these services don't have access to the license information of these datasets and assume the are not Open Access.

qqmyers commented 2 years ago

Dataverse does not send any rights information to Datacite - I believe it is the same as the datacite.xml metadata export. If we sent what we have now, it would be an improvement.

philippconzett commented 1 year ago

The part about delivering rights metadata to DataCite is related to issue #5889.

jggautier commented 1 year ago

This was a topic on the agenda for the Dataverse Metadata Interest Group meeting the community had on Oct 6, 2022. But we didn't have time to discuss. I'm hoping we can discuss in follow up meetings.

The curation team at Harvard's repo is looking into adding more licenses to the Harvard repo (https://github.com/IQSS/dataverse.harvard.edu/issues/193). Some of the team's research involves exploring what licenses are already being used by other repositories that have the Dataverse software's v5.10 multiple license update, and I think it would be helpful to share that data here.

For as many of the repositories on the Dataverse map where I could get the API endpoint for returning license information to work, I collected and organized the information about those installation's licenses into a publicly viewable Google Sheet.

pdurbin commented 1 year ago

information about those installation's licenses into a publicly viewable Google Sheet.

Great work, Julian! It looks like the duplicates so far (from the "Data" sheet) have to do with having hyphens (-) or spaces ( ) in the "CC" licenses:

num  name
  16 CC BY 4.0
   5 CC-BY-4.0
   8 CC BY-ND 4.0
   3 CC-BY-ND-4.0
  10 CC BY-NC 4.0
   4 CC-BY-NC-4.0
  12 CC BY-SA 4.0
   4 CC-BY-SA-4.0
   9 CC BY-NC-SA 4.0
   4 CC-BY-NC-SA-4.0
   8 CC BY-NC-ND 4.0
   4 CC-BY-NC-ND-4.0
DieuwertjeBloemen commented 1 year ago

I believe the version with hyphens is the standardized version using the SPDX list. This is also the list DataCite recommends in its metadata model for the rightsIdentifier:

A short, standardized version of the license name Example: CC-BY-3.0 A list of identifiers for commonly-used licenses may be found here: (https://spdx.org/licenses/). (https://schema.datacite.org/meta/kernel-4.4/doc/DataCite-MetadataKernel_v4.4.pdf)

philippconzett commented 1 year ago

Thanks for this overview, @jggautier,! @DieuwertjeBloemen: I also think we should stick to the DataCite/SPDX recommendation, i.e., use hyphens. Or if we prefer not having hyphens or having fewer hyphens, we could have to fields, an identifier field with hyphens according to SPDX, and a name field with as many/no hyphens we want.

pdurbin commented 1 year ago

I also think we should stick to the DataCite/SPDX recommendation, i.e., use hyphens.

Uh oh. We ship with "no hyphens" version (which explains the higher counts for that version in the wild):

$ ack "CC.*BY" scripts/api/data/licenses
scripts/api/data/licenses/licenseCC-BY-NC-ND-4.0.json
2:  "name": "CC BY-NC-ND 4.0",

scripts/api/data/licenses/licenseCC-BY-NC-SA-4.0.json
2:  "name": "CC BY-NC-SA 4.0",

scripts/api/data/licenses/licenseCC-BY-4.0.json
2:  "name": "CC BY 4.0",

scripts/api/data/licenses/licenseCC-BY-NC-4.0.json
2:  "name": "CC BY-NC 4.0",

scripts/api/data/licenses/licenseCC-BY-SA-4.0.json
2:  "name": "CC BY-SA 4.0",

scripts/api/data/licenses/licenseCC-BY-ND-4.0.json
2:  "name": "CC BY-ND 4.0",
qqmyers commented 1 year ago

We use that version because that's what Creative Commons uses.

aeonSolutions commented 1 year ago

@philippconzett personally I find more useful in science and also for early stages of entrepreneurial development the creative commons license in particular the share alike non-commercial:

" Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0)"

https://creativecommons.org/licenses/by-nc-sa/3.0/

For data , for software and also for content writing.

To be short, it promotes publicly author work faster than any other while ensuring author traceability

DieuwertjeBloemen commented 1 year ago

It makes sense for the Creative Commons licenses to use their standard if other licenses couldn't also be used in Dataverse and if DataCite didn't ask for another standard. Especially now that computational workflows and other code/software type data will be published, I think the standard should be used that is applicable for both CC license as well as the other types of licenses available. And that would be the SPDX standard as stated by DataCite. So, in my opinion, it should be the all-hyphenated version for all types of licenses.

@aeonSolutions: we're not discussing the inclusion or exclusion off CC licenses, just how their "rightsIdentifier" a.k.a. their "name" should be provided in dataverse and the metadata exports.

philippconzett commented 1 year ago

On November 10, we had a Dataverse IG/WG meeting about this feature request. I understood that the main takeaway/agreement from that meeting was as follows:

a) The metadata delivered from Dataverse to DataCite should be in line with their recommendation, i.e. the rightsIdentifier field in the JSON file should be identical with the corresponding field on the SPDX license list. In addition, the JSON file should contain the other recommended fields, i.e. rightsURI, rightsIdentifierScheme, and schemeURI.

b) The JSON files provided with new versions of Dataverse will be adapted accordingly.

c) For those installations that already have been using multiple license support, SQL scripts will be provided to update existing license information to be aligned with the new setup.

I have updated the Google spreadsheet (see tab "English v.0.3" in file "Dataverse_Standard_Licenses") and the JSON files for the standard licenses I suggest we should provide; see this Google folder. The update is based on Julian's of licenses that are currently used at Dataverse installations; see file "Licenses_used_by_Dataverse_installations_2022-11-15"). All licenses included on Julian's list (+ 2 licenses that DataverseNO is going to use) which also are found on the SPDX list are marked with "Y" in column H ("Dataverse").

Here are some issues we may want to discuss before we run steps a-c above:

  1. Some of the licenses on Julian's list are not on the SPDX list, e.g., the Spanish versions of CC licenses. How do we handle these?
  2. What do we want to be displayed on the dataset landing page, the string in rightsIdentifier field or the string in the rightsName field?
  3. Where do we want to provide the JSON files; as part of the release note (current solution), or in the GDCC GitHub repo, as discussed earlier?
  4. In addition to rightIdentifier, what information do we need/want to be stored in the Dataverse database?
jggautier commented 1 year ago

I was asked in the Nov. 10 metadata IG/WG meeting to provide info about the use of licenses in Dataverse installations (as opposed to info about the standard licenses that each installation makes available).

It's too much info to provide in a Google Sheet (too many rows and columns), so I'm adding it in a zip file here: licenseAndTermsMetdaataInDataverseInstallations.zip. This metadata was collected in early October 2022.

It might be helpful if we want to talk with depositors who've used specific licenses that installations have made available.

philippconzett commented 1 year ago

As for my question 1 above:

  1. Some of the licenses on Julian's list are not on the SPDX list, e.g., the Spanish versions of CC licenses. How do we handle these?

for a start, I suggest we stick to the SPDX list.

As for my question 2 above:

  1. What do we want to be displayed on the dataset landing page, the string in rightsIdentifier field or the string in the rightsName field?

I guess this is something that can be easily changed, so for a start, I suggest we go for rightsName. Also, according to @pdurbin on a recent Dataverese community call, the core developer team will figure out how to implement this in the code.

As for my question 3 above:

  1. Where do we want to provide the JSON files; as part of the release note (current solution), or in the GDCC GitHub repo, as discussed earlier?

@pdurbin earlier in this thread suggested that the best place for the JSON files is where they already are, in the main repo. I agree. Sorry I didn't remember this suggestion.

As for my question 4 above:

  1. In addition to rightIdentifier, what information do we need/want to be stored in the Dataverse database?

according to @pdurbin on a recent Dataverese community call, this is something the core developer team will figure out.

In sum, I suggest we start creating PRs:

mreekie commented 1 year ago

Grooming:

This issue was in the New or No Status columns on the Dataverse Global Backlog Board. Those columns have been removed. This issue has been removed from the board

This is NOT a reflection of the priority of these issues. However, as we have worked on the grooming process, it became clear that these issues which are in columns that are not stewarded will likely not get prioritized or sized.

To get this item onto the Dataverse Global Backlog Board, please reach out to one of the Stewards on the board. You can also reach out to @mreekie and I can connect you with the steward of the appropriate queue

pdurbin commented 1 year ago

@philippconzett don't worry, your PR is still on the global backlog board:

pdurbin commented 5 months ago

JP and I just wrote some guidance on adding licenses the future: https://github.com/IQSS/dataverse/pull/10426#issuecomment-2050108042

Please take a look and let us know what you think!