Open philippconzett opened 2 years ago
FWIW:
Thanks for your comments, @qqmyers. Just a short reply to your first bullet point. I think the way you have done this in v5.10 is already in line with my suggestion; cf.
uri: 1) License URI provided by the license issuer; 2) if (1) is not available, SPDX URI for the license
I think we only should use the SPDX URI when there is no (authoritative) URI provided by the license issuer.
Thank you @philippconzett starting this discussion. This is related to proper future software support (important for our project HERMES), so I'm taking the liberty to join it.
As 5.10 included the first iteration of multi license support, I think we should be very careful when taking the next steps.
Some context about interoperability: RO-Crate 1.1 uses this JSON-LD schema.org based representation of a license:
{
"@id": "https://creativecommons.org/licenses/by/4.0/",
"@type": "CreativeWork",
"name": "CC BY 4.0",
"description": "Creative Commons Attribution 4.0 International License"
}
@qqmyers removed the generation of this JSON-LD part from the code and replaced it with the URL only (which is perfectly valid schema.org syntax). The RO-Crate description
field is our shortDescription
, RO-Crate name
stays name
, etc. (We might even use our iconUrl
for schema.org/RO-Crate thumbnailUrl
)
@qqmyers: looking at https://github.com/spdx/license-list-data/blob/master/json/licenses.json, there are licenses marked as deprecated - maybe we need to open an issue at https://github.com/spdx/license-list-XML and talk to them about PDDC being deprecated by upstream (there isn't an issue for this yet).
licenseId
field, removing the hyphens to create our/schema.org name
field looks fine.name
field to create our shortDescription
/ schema.org description
sounds fine, too. I would not customize this within the data model to be more consistent in UI and exporting places.seeAlso
field for a license may provide a good candidate for our uri
. Using the SPDX reference
without the .html
as fallback sounds good. I see this SPDX ID URL is used in RO-Crate, too. (See this examples dataset)(Future: IMHO it would be great to have a summary in our UI, so people don't need to look at license texts. Maybe grabbing the quick summaries from https://tldrlegal.com helps?)
Thanks for your feedback, @poikilotherm! I wasn't aware that RO-Crate already had addressed this issue. My main concern was just that to make sure that standard licenses are described in the same way across Dataverse installations.
When I mentioned that my suggestion was meant to improve interoperability between Dataverse installations and beyond Dataverse installations, I first of all had in mind that license information from Dataverse installations should be made harvestable in a way that complies with recommendations. I'm not sure about the status of RO-Crate, but a standard that is already implemented and widely used is the DataCite Metadata Schema. The current version of this schema, v.4.4 (cf. https://schema.datacite.org/meta/kernel-4.4/), says the following about license information:
ID | DataCite-Property | Occ | Definition | Allowed values, examples, other constraints |
---|---|---|---|---|
16 | Rights | 0-n | Any rights information for this resource. The property may be repeated to record complex rights characteristics. | Free text *** Provide a rights management statement for the resource or reference a service providing such information. Include embargo information if applicable. Use the complete title of a license and include version information if applicable. May be used for software licenses. Examples: Creative Commons Attribution; 3.0 Germany License; Apache License, Version 2.02 |
16.a | rightsURI | 0-1 | The URI of the license. | Example: https://creativecommons.org/licenses/by/3.0/de/ |
16.b | rightsIdentifier | 0-1 | A short, standardized version of the license name. | Example: CC-BY-3.0. A list of identifiers for commonly-used licenses may be found here: (https://spdx.org/licenses/). |
16.c | rightsIdentifierScheme | 0-1 | The name of the scheme. | Example: SPDX |
16.d | schemeURI | 0-1 | The URI of the rightsIdentifierScheme. | Example: https://spdx.org/licenses/ |
As the license identifier, DataCite requires "a short, standardized version of the license name", and they suggest to use the SPDX identifier.
Based on the DataCite recommendations, I've updated the Google spreadsheet (see tab "English v.0.2") and the JSON files for the standard licenses I suggest we should provide on GitHub; see this Google folder.
As far as I can see, none of the standard licenses I suggest we should provide on GitHub are obsolete, so this shouldn't be a show stopper. Also pinging @janvanmansum for feedback.
Here are two JSON examples created following the suggested workflow above:
{
"rightsName": "CC0 1.0",
"rightsURI": "https://creativecommons.org/publicdomain/zero/1.0/",
"rightsIdentifier": "CC0-1.0",
"rightsIdentifierScheme": "SPDX",
"schemeURI": "https://spdx.org/licenses/",
"rightsShortDescription": "Creative Commons Zero v1.0 Universal.",
"rightsIconUrl": "https://licensebuttons.net/p/zero/1.0/88x31.png",
"rightsActive": true
}
{
"rightsName": "CC BY 4.0",
"rightsURI": "https://creativecommons.org/licenses/by/4.0/",
"rightsIdentifier": "CC-BY-4.0",
"rightsIdentifierScheme": "SPDX",
"schemeURI": "https://spdx.org/licenses/",
"rightsShortDescription": "Creative Commons Attribution 4.0 International.",
"rightsIconUrl": "https://licensebuttons.net/l/by/4.0/88x31.png",
"rightsActive": true
}
@qqmyers @pdurbin I guess we might have to change back some of the field names, in order to this not messing up your current setup, e.g., rightsName >> name?
I don't know what needs to be done to discuss this further, but I'd be happy to contribute as suggested above. For example, if you create a suitable place on GitHub, I could create and upload the JSON files, once we've agreed on how they should look like. Thanks!
I don't know what needs to be done to discuss this further
@philippconzett I'm not sure either. Perhaps we can try to make the problem more concrete with a scenario and a screenshot.
Imagine a future where you're harvesting datasets from another Dataverse installation with slightly different names. Also imagine that there's a search facet called "License" that makes these differences obvious at a glance:
Once the data is in a facet like this, it's obvious that there's a problem, that counts of the same license should be combined.
Thanks, @pdurbin, and sorry for my late reply.
The scenario you described above is definitely an example of what might be an undesired result of the current way of configuring standard licenses. A similar situation could arise in search engines supporting search/filtering based on license information, e.g., in the advanced search of BASE (https://www.base-search.net/Search/Advanced); cf. this mock-up screenshot:
In general, I think we should aim at providing license information along the recommendations of DataCite.
I'd be happy to create a pull request, but I need some help:
I suggest we make this a prioritized PR because the longer we wait, the more likely it becomes that installations configure multi-license support with the current set-up, which means that they would have to do some clean up to change the license information to be aligned with the standardized way suggested in this issue.
@philippconzett thanks. If the goal is to keep the Dataverse community together perhaps the best place for the JSON files is where they already are, in the main repo. That way, they seem more official, they can be part of the guides, and if the JSON structure needs to evolve (new fields/columns like you say), it can happen in the same pull request as the code and database changes.
I guess what I'm saying is, what if we consider the licenses in the main repo official already? And if we don't like something about them (they need more or different fields), what if we let them evolve in the main repo, at least for a while?
There are currently 453 licenses in your spreadsheet. If we were start adding more licenses to the main repo, would you want all of them at once? (Do you plan to present all 453 to your users?) A subset? How many? Thanks. For others, here's a link to your spreadsheet: https://docs.google.com/spreadsheets/d/1f_-z6vWijOvIc0tI1ezWeDEgM3U9w5qynllfyNqWYU8/edit?usp=sharing
Thanks, Phil!
Keeping the JSON files in the main repo sounds reasonable.
As for the number of licenses/JSON files, I only suggest to start with a small selection, as described above; see point 4 in the first posting. These 28 licenses are all marked with "true" in column M (=active) in the spreadsheet. I have now sorted the spreadsheet to make them appear on top. The JSON files of these licenses are in the folder "JSON files v.0.2" in the share Google folder: https://drive.google.com/drive/folders/11BF5tZ9K_S0rxrWErFQYgSCX_geQtHtq?usp=sharing.
Thanks for pinging me @philippconzett. This issue reminds me of that "things, not strings" saying, which I think is usually used when talking about knowledge graphs, but it makes sense here. I think your idea in this issue will improve the chances that most Dataverse installations will use the same strings to describe the same things.
I'm less sure it would improve interoperability "beyond Dataverse installations". What if, when a Dataverse repository that prefers displaying a "CC-0" license as "CC 0" harvests metadata from a source that uses "CC0", the Dataverse software could figure out that "CC0" is the same thing as "CC-0" and use that when displaying search results (like as facets)? Since the Dataverse software doesn't have facets for the Terms metadata, this problem isn't as noticeable now, so maybe we can cross that bridge when we get to it.
Hi all! I hope everyone is doing well.
I noted a similar problem in a different community, and just as a point of information it may be interesting to follow how they solve it: https://github.com/huggingface/datasets/issues/4298
Thanks, @jggautier + @djbrooke!
@jggautier I'm not sure I agree with you on interoperability beyond Dataverse installations. In my understanding, the main point with the DataCite Metadata Schema recommendations is to make harvested metadata interoperable. Of course, Dataverse, Dataverse installations or DataCite could create crosswalks/scripts to transform the exposed metadata into the desired DataCite format, but why not make the metadata available in a DataCite-aligned way to start with?
I now realize that starting a discussion like this on GitHub is no good idea, as only a few people in the community systematically review GitHub issues. I'll raise the issue in the Dataverse Google group, because I think DataCite-aligned metadata is important for many Dataverse installations. Thanks!
Please note, as I recently learned, that the Datacite Metadata Export exposed via OAI-PMH is not valid XML. The export also uses an outdated schema and a subset of the schemas possibilities (example is #7077).
I agree with you we should discuss this somewhere else to include more people's views.
I've raised the issue in the Dataverse Google group: https://groups.google.com/u/1/g/dataverse-community/c/4qSr0mkcyOw.
I'm adding another illustration of why this feature request should be prioritized: Metadata from Dataverse-based repositories are currently not correctly harvested by DataCite. This includes the license information. So, if you compare a DataCite metadata record from let's say Pangaea, e.g., https://search.datacite.org/works/10.1594/pangaea.940188, you can download the metadata in different formats, and you'll find correct license information:
"rightsList": [
{
"rights": "Creative Commons Attribution 4.0 International",
"rightsUri": "https://creativecommons.org/licenses/by/4.0/legalcode",
"schemeUri": "https://spdx.org/licenses/",
"rightsIdentifier": "cc-by-4.0",
"rightsIdentifierScheme": "SPDX"
}
Based on this license information, the metadata are then harvested and indexed in other discovery services, e.g., Primo (see this discussion thread in the Dataverse Google group).
On the other hand, Dataverse-based repositories do not expose license information in the way DataCite expects, and thus the DataCite metadata records from Dataverse-based repositories are lacking license information. Here's an example from DataverseNO, and here's one from DataverseNL (@janvanmansum @4tikhonov), here one from the Australian Data Archive (@stevenmce), here one from Harvard Dataverse (@pdurbin @jggautier), here one from Jülich DATA (@poikilotherm), here one from Odum (@donsizemore), and here one from Scholars Portal (@amberleahey @kaitlinnewson @meghangoodchild). As you see (cf. the DataCite JSON file), the rightslist is empty:
"rightsList": [],
As a result, if you search for data in Dataverse-based repositories in discovery services like Primo, you'll be told that you cannot access these datasets. There reason for this being that these services don't have access to the license information of these datasets and assume the are not Open Access.
Dataverse does not send any rights information to Datacite - I believe it is the same as the datacite.xml metadata export. If we sent what we have now, it would be an improvement.
The part about delivering rights metadata to DataCite is related to issue #5889.
This was a topic on the agenda for the Dataverse Metadata Interest Group meeting the community had on Oct 6, 2022. But we didn't have time to discuss. I'm hoping we can discuss in follow up meetings.
The curation team at Harvard's repo is looking into adding more licenses to the Harvard repo (https://github.com/IQSS/dataverse.harvard.edu/issues/193). Some of the team's research involves exploring what licenses are already being used by other repositories that have the Dataverse software's v5.10 multiple license update, and I think it would be helpful to share that data here.
For as many of the repositories on the Dataverse map where I could get the API endpoint for returning license information to work, I collected and organized the information about those installation's licenses into a publicly viewable Google Sheet.
information about those installation's licenses into a publicly viewable Google Sheet.
Great work, Julian! It looks like the duplicates so far (from the "Data" sheet) have to do with having hyphens (-
) or spaces (
) in the "CC" licenses:
num name
16 CC BY 4.0
5 CC-BY-4.0
8 CC BY-ND 4.0
3 CC-BY-ND-4.0
10 CC BY-NC 4.0
4 CC-BY-NC-4.0
12 CC BY-SA 4.0
4 CC-BY-SA-4.0
9 CC BY-NC-SA 4.0
4 CC-BY-NC-SA-4.0
8 CC BY-NC-ND 4.0
4 CC-BY-NC-ND-4.0
I believe the version with hyphens is the standardized version using the SPDX list. This is also the list DataCite recommends in its metadata model for the rightsIdentifier:
A short, standardized version of the license name Example: CC-BY-3.0 A list of identifiers for commonly-used licenses may be found here: (https://spdx.org/licenses/). (https://schema.datacite.org/meta/kernel-4.4/doc/DataCite-MetadataKernel_v4.4.pdf)
Thanks for this overview, @jggautier,! @DieuwertjeBloemen: I also think we should stick to the DataCite/SPDX recommendation, i.e., use hyphens. Or if we prefer not having hyphens or having fewer hyphens, we could have to fields, an identifier field with hyphens according to SPDX, and a name field with as many/no hyphens we want.
I also think we should stick to the DataCite/SPDX recommendation, i.e., use hyphens.
Uh oh. We ship with "no hyphens" version (which explains the higher counts for that version in the wild):
$ ack "CC.*BY" scripts/api/data/licenses
scripts/api/data/licenses/licenseCC-BY-NC-ND-4.0.json
2: "name": "CC BY-NC-ND 4.0",
scripts/api/data/licenses/licenseCC-BY-NC-SA-4.0.json
2: "name": "CC BY-NC-SA 4.0",
scripts/api/data/licenses/licenseCC-BY-4.0.json
2: "name": "CC BY 4.0",
scripts/api/data/licenses/licenseCC-BY-NC-4.0.json
2: "name": "CC BY-NC 4.0",
scripts/api/data/licenses/licenseCC-BY-SA-4.0.json
2: "name": "CC BY-SA 4.0",
scripts/api/data/licenses/licenseCC-BY-ND-4.0.json
2: "name": "CC BY-ND 4.0",
We use that version because that's what Creative Commons uses.
@philippconzett personally I find more useful in science and also for early stages of entrepreneurial development the creative commons license in particular the share alike non-commercial:
" Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0)"
https://creativecommons.org/licenses/by-nc-sa/3.0/
For data , for software and also for content writing.
To be short, it promotes publicly author work faster than any other while ensuring author traceability
It makes sense for the Creative Commons licenses to use their standard if other licenses couldn't also be used in Dataverse and if DataCite didn't ask for another standard. Especially now that computational workflows and other code/software type data will be published, I think the standard should be used that is applicable for both CC license as well as the other types of licenses available. And that would be the SPDX standard as stated by DataCite. So, in my opinion, it should be the all-hyphenated version for all types of licenses.
@aeonSolutions: we're not discussing the inclusion or exclusion off CC licenses, just how their "rightsIdentifier" a.k.a. their "name" should be provided in dataverse and the metadata exports.
On November 10, we had a Dataverse IG/WG meeting about this feature request. I understood that the main takeaway/agreement from that meeting was as follows:
a) The metadata delivered from Dataverse to DataCite should be in line with their recommendation, i.e. the rightsIdentifier field in the JSON file should be identical with the corresponding field on the SPDX license list. In addition, the JSON file should contain the other recommended fields, i.e. rightsURI, rightsIdentifierScheme, and schemeURI.
b) The JSON files provided with new versions of Dataverse will be adapted accordingly.
c) For those installations that already have been using multiple license support, SQL scripts will be provided to update existing license information to be aligned with the new setup.
I have updated the Google spreadsheet (see tab "English v.0.3" in file "Dataverse_Standard_Licenses") and the JSON files for the standard licenses I suggest we should provide; see this Google folder. The update is based on Julian's of licenses that are currently used at Dataverse installations; see file "Licenses_used_by_Dataverse_installations_2022-11-15"). All licenses included on Julian's list (+ 2 licenses that DataverseNO is going to use) which also are found on the SPDX list are marked with "Y" in column H ("Dataverse").
Here are some issues we may want to discuss before we run steps a-c above:
I was asked in the Nov. 10 metadata IG/WG meeting to provide info about the use of licenses in Dataverse installations (as opposed to info about the standard licenses that each installation makes available).
It's too much info to provide in a Google Sheet (too many rows and columns), so I'm adding it in a zip file here: licenseAndTermsMetdaataInDataverseInstallations.zip. This metadata was collected in early October 2022.
It might be helpful if we want to talk with depositors who've used specific licenses that installations have made available.
As for my question 1 above:
- Some of the licenses on Julian's list are not on the SPDX list, e.g., the Spanish versions of CC licenses. How do we handle these?
for a start, I suggest we stick to the SPDX list.
As for my question 2 above:
- What do we want to be displayed on the dataset landing page, the string in rightsIdentifier field or the string in the rightsName field?
I guess this is something that can be easily changed, so for a start, I suggest we go for rightsName. Also, according to @pdurbin on a recent Dataverese community call, the core developer team will figure out how to implement this in the code.
As for my question 3 above:
- Where do we want to provide the JSON files; as part of the release note (current solution), or in the GDCC GitHub repo, as discussed earlier?
@pdurbin earlier in this thread suggested that the best place for the JSON files is where they already are, in the main repo. I agree. Sorry I didn't remember this suggestion.
As for my question 4 above:
according to @pdurbin on a recent Dataverese community call, this is something the core developer team will figure out.
In sum, I suggest we start creating PRs:
Grooming:
This issue was in the New or No Status columns on the Dataverse Global Backlog Board. Those columns have been removed. This issue has been removed from the board
This is NOT a reflection of the priority of these issues. However, as we have worked on the grooming process, it became clear that these issues which are in columns that are not stewarded will likely not get prioritized or sized.
To get this item onto the Dataverse Global Backlog Board, please reach out to one of the Stewards on the board. You can also reach out to @mreekie and I can connect you with the steward of the appropriate queue
@philippconzett don't worry, your PR is still on the global backlog board:
JP and I just wrote some guidance on adding licenses the future: https://github.com/IQSS/dataverse/pull/10426#issuecomment-2050108042
Please take a look and let us know what you think!
Overview of the Feature Request With version 5.10, the long-awaited multiple-license support was released (see release notes). Thanks to all contributors! To better support interoperability between Dataverse installations and beyond Dataverse installations, I'd like to suggest standardizing the way standard license configuration is managed using multiple-license support as follows:
Content:
Code - Permissive:
Code - Copyleft:
Code - Other:
Following the suggested guidelines above, I have created a Google spreadsheet containing the necessary information to create JSON files, and I created those files by running a bash file. All these documents are available in this Google folder (you might need to log in to access it).
At a later stage, this could of course be automated by retrieving information directly from SPDX and license issuers, possibly via a controlled vocabulary hosted on SKOSMOS.
What kind of user is the feature intended for? The suggested feature is primarily intended for Sysadmins who need to install licenses on their Dataverse installation.
What inspired the request? The implementation of multiple license support released in v5.10.
What existing behavior do you want changed? The different Dataverse installations adding the same standard license with (slightly) different license information.
Any brand new behavior do you want to add to Dataverse? No, thanks.
Any related open or closed issues to this feature request? Multiple licences feature proposal #7440