cessda / cessda.cdc.versions

Issue track and wiki for the CESSDA Data Catalogue
https://datacatalogue.cessda.eu/
Apache License 2.0
0 stars 0 forks source link

PID check #320

Closed cessda-bitbucket-importer closed 1 year ago

cessda-bitbucket-importer commented 3 years ago

Original report on BitBucket by Taina Jääskeläinen.


PID is mandatory. The validator cannot check it because the IDNo is used for two purposes (study number and PID).

So CDC would need to check that:

  1. There is one or more IDNo elements in the record
  2. Agency attribute is used at least in one IDNo element
  3. At least one of the agency attributes has one of the four allowed code values (=ARK, DOI, Handle, URN) from the vocabulary: https://vocabularies.cessda.eu/vocabulary/CessdaPersistentIdentifierTypes?lang=en
  4. IDNo is a valid PID

cessda-bitbucket-importer commented 3 years ago

Original comment by John Shepherdson (GitHub: john-shepherdson).


Need to do bulk checks and provide feedback to SPs, as per CMV

cessda-bitbucket-importer commented 3 years ago

Original comment by John Shepherdson (GitHub: john-shepherdson).


According to the bulk validation run from 10 October 2021 (results in this GDrive folder), SND, UniData, SODHA, SoDaNet have no validation errors agains any of their records.

Example of valid SoDaNet titlStmt:

https://datacatalogue.sodanet.gr/oai?verb=GetRecord&identifier=doi:10.17903/FK2/BAJJV1&metadataPrefix=oai_ddi

Example of a valid SND titlStmt:

https://snd.gu.se/en/oai-pmh?verb=GetRecord&identifier=snd0137&metadataPrefix=oai_ddi25&set=subject:social-sciences

Example of a valid SODHA titlStmt:

https://www.sodha.be/oai?verb=GetRecord&identifier=doi:10.34934/DVN/0RKEGK&metadataPrefix=oai_ddi

cessda-bitbucket-importer commented 3 years ago

Original comment by John Shepherdson (GitHub: john-shepherdson).


However, here is a 'valid' Unidata record:

http://oai.unidata.unimib.it/v0/oai?verb=GetRecord&identifier=SN068&metadataPrefix=oai_ddi25

cessda-bitbucket-importer commented 3 years ago

Original comment by John Shepherdson (GitHub: john-shepherdson).


In the SoDaNet case, the agency attribute contains one of the allowed values and the IDNo is a valid PID.

In the UniData case, the agency attribute does not contain one of the allowed values and the IDNo is not a valid PID.

cessda-bitbucket-importer commented 3 years ago

Original comment by John Shepherdson (GitHub: john-shepherdson).


So CMV is capable of indicating false positives, in regard to PIDs.

It appears that a 2-pass approach is required.

  1. Bulk validation with records being flagged compliant or non-compliant
  2. Check all valid records, and mark as non-compliant any that do not meet all 4 of the criteria in the description of this issue.

cessda-bitbucket-importer commented 3 years ago

Original comment by Matthew Morris (GitHub: matthew-morris-cessda).


A filter could be added to the validator or the indexer enforcing that these these criteria are true. A basic check would be seeing if an IDNo element contains a valid URN which should satisfy 1, 2 and 4.

cessda-bitbucket-importer commented 2 years ago

Original comment by John Shepherdson (GitHub: john-shepherdson).


Push back to next release

cessda-bitbucket-importer commented 2 years ago

Original comment by John Shepherdson (GitHub: john-shepherdson).


Re DDI 3.x PID checks

From: Bell, Darren S
Date: Tue, 30 Nov 2021 at 10:37
Subject: RE: [CESSDA] Bulk metadata validation of your OAI-PMH endpoint
To: John Shepherdson
Cc: Beeken, Jeannine C T, Bolton, Sharon E , Taina Jääskeläinen (TAU), Hassan, Steve

Depends which object you’re trying to capture PIDs for but the dataset will be ddi:DDIInstance/r:ResourcePackage/pi:PhysicalInstance/r:Citation/r:InternationalIdentifier/r:IdentifierContent
and the study will be /ddi:DDIInstance/s:StudyUnit/r:Citation/r:InternationalIdentifier/r:IdentifierContent.  However, as I understand it, EQB is going to be harvesting into Colectica, which will be using 3.3 and the profile for that is not due until Dec 22 under MDO D16, although it will likely be done sooner.

Best, Darren

==========================

Hi both,

I’m guessing that the PID check will be done against the elements that will be used for PID in the final EQB and CDC DDI 3.2 profiles.

Timetable for these was in March 2022, if I remember correctly.

In the draft profile, these elements seem to be

/ddi:DDIInstance/s:StudyUnit/r:Citation/r:InternationalIdentifier/r:IdentifierContent

/ddi:DDIInstance/s:StudyUnit/r:Citation/r:InternationalIdentifier/r:ManagingAgency

Darren, if DDI 3.3. has another element that is particularly for PID and not just an international identifier, then MDO has things to discuss in the DDI 3.2 X-paths  to DDI3.3. X-paths mapping.

Hope I did not misunderstand something (always possible).

All the best,

Taina

==========================

Hi Taina – happily, the 3.3 XPaths are the same as 3.2 in these particular instances.  Best, Darren

cessda-bitbucket-importer commented 2 years ago

Original comment by Matthew Morris (GitHub: matthew-morris-cessda).


I've started working on this.

cessda-bitbucket-importer commented 2 years ago

Original comment by Matthew Morris (GitHub: matthew-morris-cessda).


[link to pull request removed](link to pull request removed)

A PID check was added to the validator and was merged in this PR.

cessda-bitbucket-importer commented 2 years ago

Original comment by John Shepherdson (GitHub: john-shepherdson).


Recent PID checking of Progedo records has throw up a number of false positives as it the rules do not handle the situation where there are 2 IDNo fields being used for different purposes in the same record.

Example records include doi:10.48756/ined-IE0245-2170, doi:10.48756/ined-IE0244-4961, doi:10.48756/ined-IE0237-1486

cessda-bitbucket-importer commented 2 years ago

Original comment by John Shepherdson (GitHub: john-shepherdson).


If an IDNo element contains a valid PID (i.e. valid agency plus valid URI), then the record should pass the PID check, regardless of the contents of any other IDNo elements that may be present.

cessda-bitbucket-importer commented 2 years ago

Original comment by Taina Jääskeläinen.


Yes, exactly. Trying to clarify what needs to be checked somewhere under the hood at CDC end. The check focuses on the @‌agency attribute.

This is the only thing that needs to be checked. For instance, the IDNo provided by UniData copied above does not correspond to requirements since the @‌agency value is ‘UniData’. And SND has a doi but they use as @‌agency value ‘DataCite’ when it should be ‘DOI’.

The validity of the PID itself is not checked.

An organisation may produce more IDNO element and there may be more than one that is a PID but it is enough to check that they have at least one IDNo element with the correct @‌agency value.

Hope this helps.

cessda-bitbucket-importer commented 2 years ago

Original comment by John Shepherdson (GitHub: john-shepherdson).


Not on critical path for v3.0.0 release

cessda-bitbucket-importer commented 2 years ago

Original comment by John Shepherdson (GitHub: john-shepherdson).


An example of a false positive from an AUSSDA record:

<IDNo xml:lang="en" agency="DOI">10.11587/AROIHY</IDNo>

The PID error is:
{"valid":false,"invalidPIDs":[{"agency":"DOI","uri":"10.11587/AROIHY","state":["AGENCY_PRESENT","AGENCY_ALLOWED_VALUE"]}]}

According to https://dx.doi.org/ the DOI is valid (resolves to https://data.aussda.at/dataset.xhtml?persistentId=doi:10.11587/AROIHY)

cessda-bitbucket-importer commented 2 years ago

Original comment by John Shepherdson (GitHub: john-shepherdson).


Note on policy used by Aggregator:

One thing that comes to mind is how you will process input files with multiple IDNo. How do you decide which to keep? Would that be DOI>Handle>URN>ARK>other with first matching if multiple of the same?

Yes, this the current implementation. Agency attribute's value is used
to get the preferred ID.

Preference by priority:

  1. DOI
  2. Handle
  3. URN
  4. ARK

cessda-bitbucket-importer commented 2 years ago

Original comment by John Shepherdson (GitHub: john-shepherdson).


All UKDS records are marked as having invalid PIDs, but error messages shows a false positive:

{"valid":false,"invalidPIDs":[{"agency":"UKDA","uri":"999","state":["AGENCY_PRESENT"]},{"agency":"DOI","uri":"10.5255/UKDA-SN-999-1","state":["AGENCY_PRESENT","AGENCY_ALLOWED_VALUE"]}]}

cessda-bitbucket-importer commented 1 year ago

Original comment by Taina Jääskeläinen.


This would be good to fix if the false positives mean that the record is not validated and not included in the aggregator. I do not know whether dropped or not at present.

cessda-bitbucket-importer commented 1 year ago

Original comment by Matthew Morris (GitHub: matthew-morris-cessda).


I’ve added a check for DOIs that are not HTTP(S) links so that they are marked as valid.