Updates API to support version 1.1 of GBIF Metadata Profile

kbraak commented 8 years ago

Implements the changes outlined in POR-2562.

Worthy of special attention, is the addition of a License enumeration, used to add a license property to the Dataset object. The License enumeration is restricted to the 3 CC licenses that GBIF currently supports. GBIF will have to support newer versions of these licenses in the future when they become available. It is important that GBIF differentiate between license/waiver versions because there are significant changes between them. To start, however, GBIF's recommendation (based on CC's recommendation) is to only support the latest version of each license: CC0 v1.0, CC-BY v4.0, and CC-BY-NC v4.0

A free-text rights statement (Dataset.rights property) can still be provided by publishers not wanting to adopt a GBIF-supported license. Datasets without a GBIF-supported license will not be indexed, however, they may remain discoverable on http://www.gbif.org as metadata-only datasets.

Not included in this pull request, but something we might like to add is a new property named projectIdentifier to the Dataset model object. This will enable searching datasets by project identifier, see POR-3129.

Thank you for everyone's review of the proposed set of changes and recommendation above.

timrobertson100 commented 8 years ago

I suggest being really sure we need to have the GBIF governance roles (DELEGATE etc) in here.

The Directory API is the authority for that kind of information, so we should consider if we could just use that (the new portal will be). If we do need it, I suggest a single vocabulary (enum or whatever) as the implementation means for consistency.

Otherwise, looks good with minor comments

kbraak commented 8 years ago

@timrobertson100 @cgendreau

Thank you both for your review of this pull request. A few additional notes:

Speaking with Christian, we agreed that separating out the Directory roles from our ContactType enumeration would be quite invasive, requiring a lot of changes in the current portal and registry. It is proposed to postpone this change for now.

I added POR-3133 to try and scope out registry changes related to ensuring all datasets have a gbif-supported license applied to them.

One unanswered question, is how to handle the registration of datasets without a gbif-supported license? Assuming Dataset.license is a mandatory field, I propose adding UNKNOWN to the License enumeration. We won't index datasets with an UNKNOWN license, however, it will ensure that they remain discoverable via gbif.org in metadata-only format with their verbatim rights statement shown.

Another unanswered question, is how to handle datasets that have been firmly assigned a non gbif-supported license? Should we add NON_GBIF_SUPPORTED to the License enumeration? We won't index datasets with a NON_GBIF_SUPPORTED license, however, it will ensure that the GBIF Data Manager knows these datasets do not require any following up.

Lastly, anticipating that POR-3138 may also require an API change, I would like to try and implement this enhancement at the same time as POR-2562. Depending on the deadline for licensing work (end of July?) I will kindly ask for some assistance with implementation since I'm on vacation the next couple weeks. Thanks.

timrobertson100 commented 7 years ago

Thanks Kyle.

I'd suggest using UNSPECIFIED and UNSUPPORTED as names, and neither should be indexed as you say.

kbraak commented 7 years ago

@timrobertson100 @cgendreau

I have now addressed your feedback on this pull request by pushing a new set of changes to this branch. The branch was rebased with changes made in the master branch, and ready to merge into master when ready.

You can see the latest changes I made by clicking on "view changes" above.

Also please note POR-3133 has been updated following our recent discussions. This Jira outlines the registry changes that are needed in order to ensure that all datasets are assigned a license. Your review of these specifications would be greatly appreciated also. Thanks.

timrobertson100 commented 7 years ago

It looks good to me.

I think it is highly unlikely that we will see people only use the License URL even with the best documentation in the world. I expect we will need a more lenient Optional<License> infer(String text) method which attempts to parse whatever formats are given. We will surely see CC0, CC-BY etc, which we should handle correctly. A CC-BY v3.0 of course would not be assumed to be v4, but CC-BY for example should be handled shouldn't it? Would it not be best to have that in the License class so that it is in one place only?

Similarly, I think the IPT puts out the field in ...<ulink>...</ulink> format. Would it make sense to parse that in the proposed infer() method too? If not, we'll end up doing that in various places (registry code and crawler code and likely other places too).

cgendreau commented 7 years ago

Wouldn't make sense to create a license parser in the parsers project?

mdoering commented 7 years ago

I would also prefer a new parser as we do for all other enumerations.

kbraak commented 7 years ago

Thanks everybody for your feedback.

Machine readable licenses can only be specified in:

Metadata document using GBIF Metadata Profile v1.1 (as part of DwC-A, or by itself)
ABCD 2.06 (e.g. in BioCASE/TAPIR responses)

In both cases, the machine readable license is parsed from a field that must be a URI. Therefore we don't need to worry about parsing simple license acronyms such as CC-BY.

Crawler uses the registry to perform dataset metadata updates, so I believe we can restrict all parsing of machine readable licenses to the registry. For example:

When a DwC-A Dataset is crawled, the EML file is pushed onto the Dataset. The registry dataset service is called, responsible for parsing the EML document and updating the Dataset.
Conversely for DiGIR/BioCASE/TAPIR datasets, the metadata is not updated when the dataset is crawled. The registry metasync service is called whenever the installation is synchronised, responsible for updating the metadata for all datasets served from the installation.

Adding a new parser for licenses sounds good. The registry metadata project already uses parsers for Rank, Country, etc. therefore this is nice and consistent.

As part of the new parser for licenses, its dictionary file will naturally enforce the mappings from license URI to License enum and thus can manage a set exceptions. An exception being a mapping between a non-CC license and one of the GBIF-CC licenses we consider it to be equivalent to.

Note currently GBIF has promised to maintain the following set of exceptions:

GBIF considers PDDL v1.0 equal to CC0 v1.0
GBIF considers ODC-By v1.0 equal to CC-BY v4.0

Note GBIF does NOT consider ODC-ODbL v1.0 equal to CC-BY-NC v4.0. Therefore this mapping will be set to License.UNSUPPORTED.

timrobertson100 commented 7 years ago

In both cases, the machine readable license is parsed from a field that must be a URI. Therefore we don't need to worry about parsing simple license acronyms such as CC-BY.

This is incorrect @kbraak. ABCD 2.06 requires a TEXT format of the license statement, but the URI version is optional. If the URI is missing, which will happen, we need to parse the text.

If someone issues POST /api/v1/dataset without a license, will it default to CC-BY? I would suggest this is sensible, given the applications would have been written at a time where our sharing agreement was effectively CC-BY. Alternatively we would have to set it to missing and then not crawl, but this would mean folks like the Ireland group would need to make coding changes.

kbraak commented 7 years ago

Thanks @timrobertson100 - the license parser will be implemented then to accommodate simple license acronyms. Fortunately the BioCASe wiki has just been updated to recommend supplying both the license Text and URI, so in the long run hopefully we always get a license URI.

For new dataset registrations, I think it is sensible to default to CC-BY whenever a license hasn't been supplied yet.

In the case we choose to use this default, I think we should make this a clear provision in the Data Sharing Agreement and make sure all users registering datasets agree to abide by this agreement (whether via the IPT or our API). Information about our default license should also be included in GBIF's Licensing Policy and probably even in the GBIF API documentation.

Furthermore regarding the Data Sharing Agreement I also think we should revise the following contradictory provision that allows publishers to circumvent GBIF's licensing policy by supplying a free-text rights statements in their metadata:

Biodiversity data accessible via the GBIF network are openly and universally available to all users ... with the terms and conditions that the Data Publisher has identified in its metadata.

Indeed our Data Use Agreement has a provision that reinforces users' need to respect such free-text rights in the metadata:

Users must comply with additional terms and conditions of use set by the Data Publisher. Where these exist they will be available through the metadata associated with the data.

IPT publishers have always been asked to read and accept the Data Sharing Agreement before proceeding with registering their dataset. I wonder if users of our API (the ones granted write privileges) have been asked to abide by the Data Sharing Agreement? At least when creating a new account with GBIF users only has to read and accept the Data Use Agreement.

kcopas commented 7 years ago

Not sure whether it will affect the above discussion, but note that both the data sharing and data use agreements will be updated as we implement machine-readable licences. Note, too, that both are also slightly retitled, as data publisher and data user agreements.

Was planning on getting text prepped tomorrow and leaving pages in draft state, but open to suggestions on whether that should wait, based on the various aspects of this thread.

Kyle Copas Niels W. Gades Gade 55 2100 Copenhagen Ø (+45) 28 75 14 75 | skype kylecopas

On Wed, Jul 27, 2016 at 3:22 PM, kbraak notifications@github.com wrote:

Thanks @timrobertson100 https://github.com/timrobertson100 - the license parser will be implemented then to accommodate simple license acronyms. Fortunately the BioCASe wiki http://wiki.bgbm.org/bps/index.php/CommonABCD2Concepts has just been updated to recommend supplying both the license Text and URI, so in the long run hopefully we always get a license URI.

For new dataset registrations, I think it is sensible to default to CC-BY whenever a license hasn't been supplied yet.

In the case we choose to use this default, I think we should make this a clear provision in the Data Sharing Agreement http://www.gbif.org/terms/licences/data-sharing and make sure all users registering datasets agree to abide by this agreement (whether via the IPT or our API). Information about our default license should also be included in GBIF's Licensing Policy http://www.gbif-uat.org/terms/licences and probably even in the GBIF API documentation http://www.gbif.org/developer/summary.

Furthermore regarding the Data Sharing Agreement http://www.gbif.org/terms/licences/data-sharing I also think we should revise the following contradictory provision that allows publishers to circumvent GBIF's licensing policy by supplying a free-text rights statements in their metadata:

Biodiversity data accessible via the GBIF network are openly and universally available to all users ... with the terms and conditions that the Data Publisher has identified in its metadata.

Indeed our Data Use Agreement http://www.gbif.org/terms/licences/data-use has a provision that reinforces users' need to respect such free-text rights in the metadata:

Users must comply with additional terms and conditions of use set by the Data Publisher. Where these exist they will be available through the metadata associated with the data.

IPT publishers have always been asked to read and accept the Data Sharing Agreement http://www.gbif.org/terms/licences/data-sharing before proceeding with registering their dataset. I wonder if users of our API (the ones granted write privileges) have been asked to abide by the Data Sharing Agreement? At least when creating a new account with GBIF users only has to read and accept the Data Use Agreement http://www.gbif-uat.org/terms/licences/data-use.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/gbif/gbif-api/pull/1#issuecomment-235582731, or mute the thread https://github.com/notifications/unsubscribe-auth/ALzkZL84ISZPPlgkrQu7ZqXZ3W9n1-pyks5qZ1t4gaJpZM4I7_Dy .

kbraak commented 7 years ago

Thanks for keeping an eye on this discussion @kcopas. Ultimately we want to be able to legitimately assign CC-BY to new datasets without any license, backed by our updated agreements and documentation.

@timrobertson100 @cgendreau @mdoering please note a new license parser has been added in this pull request, open for review.

gbif / gbif-api

Updates API to support version 1.1 of GBIF Metadata Profile #1