NCEAS / metacat

Data repository software that helps researchers preserve, share, and discover data
https://knb.ecoinformatics.org/software/metacat
GNU General Public License v2.0
27 stars 13 forks source link

Invalid resource maps can be submitted to Metacat without error #1981

Open robyngit opened 1 week ago

robyngit commented 1 week ago

While investigating our collection of submission errors in MetacatUI, I discovered that it's possible submit an invalid resource map to Metacat and receive a 200 status without any error. I would expect that resource maps would be validated in the same way that sysmeta and EML objects are.

Here's a reproducible example:

  1. Create an invalid resource map. In my case, I saved a resource_map.xml file with the following text: <?xml version="1.0" encod

  2. Create sysmeta for the object. I used this sysmeta_template.rdf.xml:

    <d1_v2.0:systemMetadata xmlns:d1_v2.0="http://ns.dataone.org/service/types/v2.0"
    xmlns:d1="http://ns.dataone.org/service/types/v1">
    <serialVersion>0</serialVersion>
    <identifier>RESOURCE MAP ID HERE</identifier>
    <formatId>http://www.openarchives.org/ore/terms</formatId>
    <size>25</size>
    <checksum algorithm="MD5">9614dd15192a58ae2a91a6243e70a992</checksum>
    <submitter>http://orcid.org/0000-0002-1615-3963</submitter>
    <rightsHolder>http://orcid.org/0000-0002-1615-3963</rightsHolder>
    <accessPolicy>
    <allow>
      <subject>public</subject>
      <permission>read</permission>
    </allow>
    <allow>
      <subject>CN=arctic-data-admins,DC=dataone,DC=org</subject>
      <permission>read</permission>
      <permission>write</permission>
      <permission>changePermission</permission>
    </allow>
    </accessPolicy>
    <fileName>resource_map.xml</fileName>
    </d1_v2.0:systemMetadata>

    You'll want to the submitter to your ORCID

  3. Generate a PID, update the sysmeta template, then upload the resource map + sysmeta to a test node:

    
    # 1. Set your token
    TOKEN="your-token-here"

2. Generate the pid

PID="resource_map_urn:uuid:$(uuidgen)"

3. Make a copy of the sysmeta with the new PID

cp sysmeta_template.rdf.xml sysmeta.rdf.xml sed -i '' "s/RESOURCE MAP ID HERE/$PID/" sysmeta.rdf.xml

echo "\nUploading bad resource map with PID: $PID"

echo "\nResource Map:\n" cat resource_map.xml

echo "\n\nSysmeta:\n" cat sysmeta.rdf.xml

echo "\n\n\n OUTPUT FROM CURL COMMAND: \n"

/opt/homebrew/opt/curl/bin/curl -i \ -X POST \ -H "Accept: /" \ -H "Authorization: Bearer $TOKEN" \ -F "pid=$PID" \ -F "sysmeta=@sysmeta.rdf.xml;type=application/xml" \ -F "object=@resource_map.xml;type=application/xml" \ "https://dev.nceas.ucsb.edu/knb/d1/mn/v2/object"

echo "\n\n Done"


4. See that the server returns a `HTTP/1.1 200 200` status along with the the PID for the resource map:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>

resource_map_urn:uuid:7286F53A-D29B-4087-9CE8-DEE244EEE5F6


The file then exists on the server, but of course, is not _really_ a resource map, e.g.
- [meta query](https://dev.nceas.ucsb.edu/knb/d1/mn/v2/meta/resource_map_urn:uuid:7286F53A-D29B-4087-9CE8-DEE244EEE5F6)
- [object query](https://dev.nceas.ucsb.edu/knb/d1/mn/v2/object/resource_map_urn:uuid:7286F53A-D29B-4087-9CE8-DEE244EEE5F6)
- [solr query](https://dev.nceas.ucsb.edu/knb/d1/mn/v2/query/solr/?q=id:%22resource_map_urn:uuid:7286F53A-D29B-4087-9CE8-DEE244EEE5F6%22)

---

Here's the code above as downloadable files (just remember to remove the `.txt`).
[sysmeta_template.rdf.xml.txt](https://github.com/user-attachments/files/17298578/sysmeta_template.rdf.xml.txt)
[create_res_map.sh.txt](https://github.com/user-attachments/files/17298788/create_res_map.sh.txt)
[resource_map.xml.txt](https://github.com/user-attachments/files/17298580/resource_map.xml.txt)
mbjones commented 1 week ago

Metacat only validates selected data formats based on their formatId. As far as I know, only XML metadata documents are validated, and then only if they have an XML schema registered with Metacat for that document format. We've talked about adding a SHACL validator for RDF resource maps, but haven't done so to date. As RDF is an open world model, and any triples you want can be added, its hard to say what the right schema to enforce would be. I suppose enforcing the bare minimum structure would make sense -- e.g., that there is a ore:ResourceMap with an ore:Aggregation, and that each member of the aggregation has a dc:identifier. DataONE lists its resource requirements here: https://dataoneorg.github.io/api-documentation/design/DataPackage.html#generating-resource-maps

So from those DataONE rules linked above, the items to validate might include:

  1. Document is well-formed RDF
  2. all DataONE objects in the map MUST be expressed as a URI using DataONE’s resolving service
  3. The graph MUST contain an ore:ResourceMap and an ore:Aggregation
  4. The resource map MUST assert a triple with the ore:describes/ore:isDescribedBy relationship between the resource map and the aggregation
  5. Each DataONE object in the aggregation MUST be described with an dcterms:identifier field containing the DataONE identifier.
  6. when expressing an identifier in a URI, it must be URL encoded. When expressing in the dcterms:identifier field, it must not. (Of course any XML encoding would need to be applied as well, in the example below, there is none needed).

Here's what a minimal resource map might contain if the package has one metadata object and one data object and follows these rules:

@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix cito: <http://purl.org/spar/cito/> .
@prefix dc: <http://purl.org/dc/elements/1.1/> .
@prefix dcterms: <http://purl.org/dc/terms/> .
@prefix foaf: <http://xmlns.com/foaf/0.1/> .
@prefix ore: <http://www.openarchives.org/ore/terms/> .
@prefix prov: <http://www.w3.org/ns/prov#> .
@prefix provone: <http://purl.dataone.org/provone/2015/01/15/ontology#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
@prefix dataone: <https://cn.dataone.org/cn/v2/resolve/> .

<dataone:METADATA_ID>
    dcterms:identifier "METADATA_ID"^^xsd:string ;
    cito:documents <dataone:METADATA_ID>, <dataone:DATAOBJ_ID> ;
    cito:isDocumentedBy <dataone:METADATA_ID> ;
    ore:isAggregatedBy <dataone:RESOURCE_MAP_ID#aggregation> .

<dataone:RESOURCE_MAP_ID>
    dcterms:creator [
        a dcterms:Agent ;
        foaf:name "DataONE R Client"^^xsd:string
    ] ;
    dcterms:identifier "RESOURCE_MAP_ID"^^xsd:string ;
    dcterms:modified "2024-10-08T20:24:47Z"^^xsd:dateTime ;
    ore:describes <dataone:RESOURCE_MAP_ID#aggregation> ;
    a ore:ResourceMap .

<dataone:RESOURCE_MAP_ID#aggregation>
    dc:title "DataONE Aggregation" ;
    ore:aggregates <dataone:METADATA_ID>, <dataone:DATAOBJ_ID> ;
    a ore:Aggregation .

<dataone:DATAOBJ_ID>
    dcterms:identifier "DATAOBJ_ID"^^xsd:string ;
    cito:isDocumentedBy <dataone:METADATA_ID> ;
    ore:isAggregatedBy <dataone:RESOURCE_MAP_ID#aggregation> .
flowchart TD
    A(ore:ResourceMap RESOURCE_MAP_ID) -->|ore:describes| B(ore:Aggregation)
    B --> |ore:aggregates| C(METADATA_ID)
    B --> |ore:aggregates| D(DATAOBJ_ID)
    C --> |cito:documents| C
    C --> |cito:documents| D

So, we'd need SHACL rules for those conditions listed above. Would that be sufficient? Also, how would we deal with RMs that are currently in the system but are not valid according to those rules?