Closed cessda-bitbucket-importer closed 1 year ago
Original comment by John Shepherdson (GitHub: john-shepherdson).
Need to do bulk checks and provide feedback to SPs, as per CMV
Original comment by John Shepherdson (GitHub: john-shepherdson).
According to the bulk validation run from 10 October 2021 (results in this GDrive folder), SND, UniData, SODHA, SoDaNet have no validation errors agains any of their records.
Example of valid SoDaNet titlStmt:
Example of a valid SND titlStmt:
Example of a valid SODHA titlStmt:
https://www.sodha.be/oai?verb=GetRecord&identifier=doi:10.34934/DVN/0RKEGK&metadataPrefix=oai_ddi
Original comment by John Shepherdson (GitHub: john-shepherdson).
However, here is a 'valid' Unidata record:
http://oai.unidata.unimib.it/v0/oai?verb=GetRecord&identifier=SN068&metadataPrefix=oai_ddi25
Original comment by John Shepherdson (GitHub: john-shepherdson).
In the SoDaNet case, the agency attribute contains one of the allowed values and the IDNo is a valid PID.
In the UniData case, the agency attribute does not contain one of the allowed values and the IDNo is not a valid PID.
Original comment by John Shepherdson (GitHub: john-shepherdson).
So CMV is capable of indicating false positives, in regard to PIDs.
It appears that a 2-pass approach is required.
Original comment by Matthew Morris (GitHub: matthew-morris-cessda).
A filter could be added to the validator or the indexer enforcing that these these criteria are true. A basic check would be seeing if an IDNo
element contains a valid URN which should satisfy 1, 2 and 4.
Original comment by John Shepherdson (GitHub: john-shepherdson).
Re DDI 3.x PID checks
From: Bell, Darren S
Date: Tue, 30 Nov 2021 at 10:37
Subject: RE: [CESSDA] Bulk metadata validation of your OAI-PMH endpoint
To: John Shepherdson
Cc: Beeken, Jeannine C T, Bolton, Sharon E , Taina Jääskeläinen (TAU), Hassan, Steve
Depends which object you’re trying to capture PIDs for but the dataset will be ddi:DDIInstance/r:ResourcePackage/pi:PhysicalInstance/r:Citation/r:InternationalIdentifier/r:IdentifierContent
and the study will be /ddi:DDIInstance/s:StudyUnit/r:Citation/r:InternationalIdentifier/r:IdentifierContent. However, as I understand it, EQB is going to be harvesting into Colectica, which will be using 3.3 and the profile for that is not due until Dec 22 under MDO D16, although it will likely be done sooner.
Best, Darren
==========================
Hi both,
I’m guessing that the PID check will be done against the elements that will be used for PID in the final EQB and CDC DDI 3.2 profiles.
Timetable for these was in March 2022, if I remember correctly.
In the draft profile, these elements seem to be
/ddi:DDIInstance/s:StudyUnit/r:Citation/r:InternationalIdentifier/r:IdentifierContent
/ddi:DDIInstance/s:StudyUnit/r:Citation/r:InternationalIdentifier/r:ManagingAgency
Darren, if DDI 3.3. has another element that is particularly for PID and not just an international identifier, then MDO has things to discuss in the DDI 3.2 X-paths to DDI3.3. X-paths mapping.
Hope I did not misunderstand something (always possible).
All the best,
Taina
==========================
Hi Taina – happily, the 3.3 XPaths are the same as 3.2 in these particular instances. Best, Darren
Original comment by Matthew Morris (GitHub: matthew-morris-cessda).
[link to pull request removed](link to pull request removed)
A PID check was added to the validator and was merged in this PR.
Original comment by John Shepherdson (GitHub: john-shepherdson).
Recent PID checking of Progedo records has throw up a number of false positives as it the rules do not handle the situation where there are 2 IDNo fields being used for different purposes in the same record.
Example records include doi:10.48756/ined-IE0245-2170, doi:10.48756/ined-IE0244-4961, doi:10.48756/ined-IE0237-1486
Original comment by John Shepherdson (GitHub: john-shepherdson).
If an IDNo element contains a valid PID (i.e. valid agency plus valid URI), then the record should pass the PID check, regardless of the contents of any other IDNo elements that may be present.
Original comment by Taina Jääskeläinen.
Yes, exactly. Trying to clarify what needs to be checked somewhere under the hood at CDC end. The check focuses on the @agency attribute.
This is the only thing that needs to be checked. For instance, the IDNo provided by UniData copied above does not correspond to requirements since the @agency value is ‘UniData’. And SND has a doi but they use as @agency value ‘DataCite’ when it should be ‘DOI’.
The validity of the PID itself is not checked.
An organisation may produce more IDNO element and there may be more than one that is a PID but it is enough to check that they have at least one IDNo element with the correct @agency value.
Hope this helps.
Original comment by John Shepherdson (GitHub: john-shepherdson).
Not on critical path for v3.0.0 release
Original comment by John Shepherdson (GitHub: john-shepherdson).
An example of a false positive from an AUSSDA record:
<IDNo xml:lang="en" agency="DOI">10.11587/AROIHY</IDNo>
The PID error is:
{"valid":false,"invalidPIDs":[{"agency":"DOI","uri":"10.11587/AROIHY","state":["AGENCY_PRESENT","AGENCY_ALLOWED_VALUE"]}]}
According to https://dx.doi.org/ the DOI is valid (resolves to https://data.aussda.at/dataset.xhtml?persistentId=doi:10.11587/AROIHY)
Original comment by John Shepherdson (GitHub: john-shepherdson).
Note on policy used by Aggregator:
One thing that comes to mind is how you will process input files with multiple IDNo. How do you decide which to keep? Would that be DOI>Handle>URN>ARK>other with first matching if multiple of the same?
Yes, this the current implementation. Agency attribute's value is used
to get the preferred ID.
Preference by priority:
Original comment by John Shepherdson (GitHub: john-shepherdson).
All UKDS records are marked as having invalid PIDs, but error messages shows a false positive:
{"valid":false,"invalidPIDs":[{"agency":"UKDA","uri":"999","state":["AGENCY_PRESENT"]},{"agency":"DOI","uri":"10.5255/UKDA-SN-999-1","state":["AGENCY_PRESENT","AGENCY_ALLOWED_VALUE"]}]}
Original comment by Taina Jääskeläinen.
This would be good to fix if the false positives mean that the record is not validated and not included in the aggregator. I do not know whether dropped or not at present.
Original comment by Matthew Morris (GitHub: matthew-morris-cessda).
I’ve added a check for DOIs that are not HTTP(S) links so that they are marked as valid.
Original report on BitBucket by Taina Jääskeläinen.
PID is mandatory. The validator cannot check it because the IDNo is used for two purposes (study number and PID).
So CDC would need to check that: