Open robyngit opened 1 week ago
Metacat only validates selected data formats based on their formatId
. As far as I know, only XML metadata documents are validated, and then only if they have an XML schema registered with Metacat for that document format. We've talked about adding a SHACL validator for RDF resource maps, but haven't done so to date. As RDF is an open world model, and any triples you want can be added, its hard to say what the right schema to enforce would be. I suppose enforcing the bare minimum structure would make sense -- e.g., that there is a ore:ResourceMap
with an ore:Aggregation
, and that each member of the aggregation has a dc:identifier
. DataONE lists its resource requirements here: https://dataoneorg.github.io/api-documentation/design/DataPackage.html#generating-resource-maps
So from those DataONE rules linked above, the items to validate might include:
ore:ResourceMap
and an ore:Aggregation
ore:describes
/ore:isDescribedBy
relationship between the resource map and the aggregationdcterms:identifier
field containing the DataONE identifier.Here's what a minimal resource map might contain if the package has one metadata object and one data object and follows these rules:
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix cito: <http://purl.org/spar/cito/> .
@prefix dc: <http://purl.org/dc/elements/1.1/> .
@prefix dcterms: <http://purl.org/dc/terms/> .
@prefix foaf: <http://xmlns.com/foaf/0.1/> .
@prefix ore: <http://www.openarchives.org/ore/terms/> .
@prefix prov: <http://www.w3.org/ns/prov#> .
@prefix provone: <http://purl.dataone.org/provone/2015/01/15/ontology#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
@prefix dataone: <https://cn.dataone.org/cn/v2/resolve/> .
<dataone:METADATA_ID>
dcterms:identifier "METADATA_ID"^^xsd:string ;
cito:documents <dataone:METADATA_ID>, <dataone:DATAOBJ_ID> ;
cito:isDocumentedBy <dataone:METADATA_ID> ;
ore:isAggregatedBy <dataone:RESOURCE_MAP_ID#aggregation> .
<dataone:RESOURCE_MAP_ID>
dcterms:creator [
a dcterms:Agent ;
foaf:name "DataONE R Client"^^xsd:string
] ;
dcterms:identifier "RESOURCE_MAP_ID"^^xsd:string ;
dcterms:modified "2024-10-08T20:24:47Z"^^xsd:dateTime ;
ore:describes <dataone:RESOURCE_MAP_ID#aggregation> ;
a ore:ResourceMap .
<dataone:RESOURCE_MAP_ID#aggregation>
dc:title "DataONE Aggregation" ;
ore:aggregates <dataone:METADATA_ID>, <dataone:DATAOBJ_ID> ;
a ore:Aggregation .
<dataone:DATAOBJ_ID>
dcterms:identifier "DATAOBJ_ID"^^xsd:string ;
cito:isDocumentedBy <dataone:METADATA_ID> ;
ore:isAggregatedBy <dataone:RESOURCE_MAP_ID#aggregation> .
flowchart TD
A(ore:ResourceMap RESOURCE_MAP_ID) -->|ore:describes| B(ore:Aggregation)
B --> |ore:aggregates| C(METADATA_ID)
B --> |ore:aggregates| D(DATAOBJ_ID)
C --> |cito:documents| C
C --> |cito:documents| D
So, we'd need SHACL rules for those conditions listed above. Would that be sufficient? Also, how would we deal with RMs that are currently in the system but are not valid according to those rules?
While investigating our collection of submission errors in MetacatUI, I discovered that it's possible submit an invalid resource map to Metacat and receive a 200 status without any error. I would expect that resource maps would be validated in the same way that sysmeta and EML objects are.
Here's a reproducible example:
Create an invalid resource map. In my case, I saved a
resource_map.xml
file with the following text:<?xml version="1.0" encod
Create sysmeta for the object. I used this
sysmeta_template.rdf.xml
:You'll want to the
submitter
to your ORCIDGenerate a PID, update the sysmeta template, then upload the resource map + sysmeta to a test node:
2. Generate the pid
PID="resource_map_urn:uuid:$(uuidgen)"
3. Make a copy of the sysmeta with the new PID
cp sysmeta_template.rdf.xml sysmeta.rdf.xml sed -i '' "s/RESOURCE MAP ID HERE/$PID/" sysmeta.rdf.xml
echo "\nUploading bad resource map with PID: $PID"
echo "\nResource Map:\n" cat resource_map.xml
echo "\n\nSysmeta:\n" cat sysmeta.rdf.xml
echo "\n\n\n OUTPUT FROM CURL COMMAND: \n"
/opt/homebrew/opt/curl/bin/curl -i \ -X POST \ -H "Accept: /" \ -H "Authorization: Bearer $TOKEN" \ -F "pid=$PID" \ -F "sysmeta=@sysmeta.rdf.xml;type=application/xml" \ -F "object=@resource_map.xml;type=application/xml" \ "https://dev.nceas.ucsb.edu/knb/d1/mn/v2/object"
echo "\n\n Done"
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>