DataONEorg / rdataone

R package for reading and writing data at DataONE data repositories
http://doi.org/10.5063/F1M61H5X
36 stars 19 forks source link

Verify package relationships before upload and warn or repair #288

Open gothub opened 2 years ago

gothub commented 2 years ago

A simple rdataone program error can create a package that will be serialized as valid RDF/XML (resmap) but will not be indexible by DataONE. The case that will be shown below should be able to be caught by rdataone before uploading the package, and either a warning printed, or the package repaired before upload.

The following program shows this case, with source comments indicating the erroneous lines and the effect.

# Create a DataObject and add it to the DataPackage
library(datapack)
library(uuid)

d1c_test <- D1Client("STAGING", "urn:node:mnTestARCTIC")
packageID <- "resource_map_urn:uuid:825cee81-e676-4a58-9a32-054884376c0c"
dp <- getDataPackage(d1c_test, packageID, lazyLoad = TRUE, quiet = FALSE)
dataID <- selectMember(dp, name = "sysmeta@fileName", value = "OwlNightj.csv")
# The next linee should be:
# dp <- replaceMember(dp, dataID, replacement=system.file("./extdata/pkg-example/binary.csv.zip", package="datapack"), formatId="application/zip")
# The next, erroneous line has the effect of the datapackage 'dp' not be updated correctly, causing the package relationships to become corrupted, and not indexable by DataONE
replaceMember(dp, dataID, replacement=system.file("./extdata/pkg-example/binary.csv.zip", package="datapack"), formatId="application/zip")
filePath <- file.path(sprintf("%s/%s.rdf", tempdir(), packageID))
status <- serializePackage(dp, filePath, id=packageID, resolveURI="https://cn-stage.test.dataone.org/cn/v2/resolve")
writeLines(readLines(filePath))

The resource map below shows that the pid urn:uuid:301e805a-66cf-41e1-99a4-2c459638802f does not have the DataONE 'resolve' URL, seen on line 34, 66:

1 <?xml version="1.0" encoding="utf-8"?>
  2 <rdf:RDF xmlns:cito="http://purl.org/spar/cito/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:dcterms="http://purl.org/dc/terms/" xmlns:foaf="http://xmlns.com/foaf/0.1/" xmlns:ore="http://www.o    penarchives.org/ore/terms/" xmlns:prov="http://www.w3.org/ns/prov#" xmlns:provone="http://purl.dataone.org/provone/2015/01/15/ontology#" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns    :rdfs="http://www.w3.org/2000/01/rdf-schema#" xmlns:xsd="http://www.w3.org/2001/XMLSchema#">
  3   <rdf:Description rdf:about="https://cn-stage.test.dataone.org/cn/v2/resolve/urn%3Auuid%3A4abb0f9f-260c-4b43-a2a7-df7578703a82">
  4     <dcterms:identifier rdf:datatype="http://www.w3.org/2001/XMLSchema#string">urn:uuid:4abb0f9f-260c-4b43-a2a7-df7578703a82</dcterms:identifier>
  5   </rdf:Description>
  6   <rdf:Description rdf:about="https://cn-stage.test.dataone.org/cn/v2/resolve/resource_map_urn%3Auuid%3A825cee81-e676-4a58-9a32-054884376c0c">
  7     <rdf:type rdf:resource="http://www.openarchives.org/ore/terms/ResourceMap"/>
  8   </rdf:Description>
  9   <rdf:Description rdf:about="https://cn-stage.test.dataone.org/cn/v2/resolve/urn%3Auuid%3A14b1048d-ce39-4439-8f7c-0d05a7cc6cfc">
 10     <dcterms:identifier rdf:datatype="http://www.w3.org/2001/XMLSchema#string">urn:uuid:14b1048d-ce39-4439-8f7c-0d05a7cc6cfc</dcterms:identifier>
 11   </rdf:Description>
 12   <rdf:Description rdf:about="https://cn-stage.test.dataone.org/cn/v2/resolve/urn%3Auuid%3A85eaf569-92c0-4b19-b13c-2b809378d92e">
 13     <dcterms:identifier rdf:datatype="http://www.w3.org/2001/XMLSchema#string">urn:uuid:85eaf569-92c0-4b19-b13c-2b809378d92e</dcterms:identifier>
 14   </rdf:Description>
 15   <rdf:Description rdf:about="https://cn-stage.test.dataone.org/cn/v2/resolve/urn%3Auuid%3A825cee81-e676-4a58-9a32-054884376c0c">
 16     <dcterms:identifier rdf:datatype="http://www.w3.org/2001/XMLSchema#string">urn:uuid:825cee81-e676-4a58-9a32-054884376c0c</dcterms:identifier>
 17   </rdf:Description>
 18   <rdf:Description rdf:about="https://cn-stage.test.dataone.org/cn/v2/resolve/resource_map_urn%3Auuid%3A825cee81-e676-4a58-9a32-054884376c0c">
 19     <ore:describes rdf:resource="https://cn-stage.test.dataone.org/cn/v2/resolve/resource_map_urn%3Auuid%3A825cee81-e676-4a58-9a32-054884376c0c#aggregation"/>
 20   </rdf:Description>
 21   <rdf:Description rdf:about="https://cn-stage.test.dataone.org/cn/v2/resolve/urn%3Auuid%3A85eaf569-92c0-4b19-b13c-2b809378d92e">
 22     <ore:isAggregatedBy rdf:resource="https://cn-stage.test.dataone.org/cn/v2/resolve/resource_map_urn%3Auuid%3A825cee81-e676-4a58-9a32-054884376c0c#aggregation"/>
 23   </rdf:Description>
 24   <rdf:Description rdf:nodeID="_fc7aa8bb-e3a6-4906-9816-447ec5d95204">
 25     <foaf:name rdf:datatype="http://www.w3.org/2001/XMLSchema#string">DataONE R Client</foaf:name>
 26   </rdf:Description>
 27   <rdf:Description rdf:about="https://cn-stage.test.dataone.org/cn/v2/resolve/urn%3Auuid%3A825cee81-e676-4a58-9a32-054884376c0c">
 28     <cito:documents rdf:resource="https://cn-stage.test.dataone.org/cn/v2/resolve/urn%3Auuid%3A825cee81-e676-4a58-9a32-054884376c0c"/>
 29   </rdf:Description>
 30   <rdf:Description rdf:about="https://cn-stage.test.dataone.org/cn/v2/resolve/urn%3Auuid%3A825cee81-e676-4a58-9a32-054884376c0c">
 31     <cito:documents rdf:resource="https://cn-stage.test.dataone.org/cn/v2/resolve/urn%3Auuid%3A4abb0f9f-260c-4b43-a2a7-df7578703a82"/>
 32   </rdf:Description>
 33   <rdf:Description rdf:about="https://cn-stage.test.dataone.org/cn/v2/resolve/urn%3Auuid%3A825cee81-e676-4a58-9a32-054884376c0c">
 34     <cito:documents rdf:resource="urn:uuid:301e805a-66cf-41e1-99a4-2c459638802f"/>
 35   </rdf:Description>
 36   <rdf:Description rdf:about="https://cn-stage.test.dataone.org/cn/v2/resolve/urn%3Auuid%3A825cee81-e676-4a58-9a32-054884376c0c">
 37     <cito:documents rdf:resource="https://cn-stage.test.dataone.org/cn/v2/resolve/urn%3Auuid%3A14b1048d-ce39-4439-8f7c-0d05a7cc6cfc"/>
 38   </rdf:Description>
 39   <rdf:Description rdf:about="https://cn-stage.test.dataone.org/cn/v2/resolve/resource_map_urn%3Auuid%3A825cee81-e676-4a58-9a32-054884376c0c">
 40     <dcterms:identifier rdf:datatype="http://www.w3.org/2001/XMLSchema#string">resource_map_urn:uuid:825cee81-e676-4a58-9a32-054884376c0c</dcterms:identifier>
 41   </rdf:Description>
 42   <rdf:Description rdf:about="https://cn-stage.test.dataone.org/cn/v2/resolve/resource_map_urn%3Auuid%3A825cee81-e676-4a58-9a32-054884376c0c#aggregation">
 43     <ore:aggregates rdf:resource="https://cn-stage.test.dataone.org/cn/v2/resolve/urn%3Auuid%3A85eaf569-92c0-4b19-b13c-2b809378d92e"/>
 44   </rdf:Description>
 45   <rdf:Description rdf:about="https://cn-stage.test.dataone.org/cn/v2/resolve/resource_map_urn%3Auuid%3A825cee81-e676-4a58-9a32-054884376c0c#aggregation">
 46     <ore:aggregates rdf:resource="https://cn-stage.test.dataone.org/cn/v2/resolve/urn%3Auuid%3A825cee81-e676-4a58-9a32-054884376c0c"/>
 47   </rdf:Description>
 48   <rdf:Description rdf:about="https://cn-stage.test.dataone.org/cn/v2/resolve/resource_map_urn%3Auuid%3A825cee81-e676-4a58-9a32-054884376c0c#aggregation">
 49     <ore:aggregates rdf:resource="https://cn-stage.test.dataone.org/cn/v2/resolve/urn%3Auuid%3A4abb0f9f-260c-4b43-a2a7-df7578703a82"/>
 50   </rdf:Description>
 51   <rdf:Description rdf:about="https://cn-stage.test.dataone.org/cn/v2/resolve/resource_map_urn%3Auuid%3A825cee81-e676-4a58-9a32-054884376c0c#aggregation">
 52     <ore:aggregates rdf:resource="https://cn-stage.test.dataone.org/cn/v2/resolve/urn%3Auuid%3A14b1048d-ce39-4439-8f7c-0d05a7cc6cfc"/>
 53   </rdf:Description>
 54   <rdf:Description rdf:about="https://cn-stage.test.dataone.org/cn/v2/resolve/resource_map_urn%3Auuid%3A825cee81-e676-4a58-9a32-054884376c0c">
 55     <dcterms:modified rdf:datatype="http://www.w3.org/2001/XMLSchema#dateTime">2022-01-25T22:15:40Z</dcterms:modified>
 56   </rdf:Description>
 57   <rdf:Description rdf:about="https://cn-stage.test.dataone.org/cn/v2/resolve/urn%3Auuid%3A825cee81-e676-4a58-9a32-054884376c0c">
 58     <cito:isDocumentedBy rdf:resource="https://cn-stage.test.dataone.org/cn/v2/resolve/urn%3Auuid%3A825cee81-e676-4a58-9a32-054884376c0c"/>
 59   </rdf:Description>
 60   <rdf:Description rdf:nodeID="_fc7aa8bb-e3a6-4906-9816-447ec5d95204">
 61     <rdf:type rdf:resource="http://purl.org/dc/terms/Agent"/>
 62   </rdf:Description>
 63   <rdf:Description rdf:about="https://cn-stage.test.dataone.org/cn/v2/resolve/urn%3Auuid%3A4abb0f9f-260c-4b43-a2a7-df7578703a82">
 64     <ore:isAggregatedBy rdf:resource="https://cn-stage.test.dataone.org/cn/v2/resolve/resource_map_urn%3Auuid%3A825cee81-e676-4a58-9a32-054884376c0c#aggregation"/>
 65   </rdf:Description>
 66   <rdf:Description rdf:about="urn:uuid:301e805a-66cf-41e1-99a4-2c459638802f">
 67     <cito:isDocumentedBy rdf:resource="https://cn-stage.test.dataone.org/cn/v2/resolve/urn%3Auuid%3A825cee81-e676-4a58-9a32-054884376c0c"/>
 68   </rdf:Description>
 69   <rdf:Description rdf:about="https://cn-stage.test.dataone.org/cn/v2/resolve/resource_map_urn%3Auuid%3A825cee81-e676-4a58-9a32-054884376c0c#aggregation">
 70     <dc:title>DataONE Aggregation</dc:title>
 71   </rdf:Description>
 72   <rdf:Description rdf:about="https://cn-stage.test.dataone.org/cn/v2/resolve/resource_map_urn%3Auuid%3A825cee81-e676-4a58-9a32-054884376c0c#aggregation">
 73     <rdf:type rdf:resource="http://www.openarchives.org/ore/terms/Aggregation"/>
 74   </rdf:Description>
 75   <rdf:Description rdf:about="https://cn-stage.test.dataone.org/cn/v2/resolve/urn%3Auuid%3A14b1048d-ce39-4439-8f7c-0d05a7cc6cfc">
 76     <ore:isAggregatedBy rdf:resource="https://cn-stage.test.dataone.org/cn/v2/resolve/resource_map_urn%3Auuid%3A825cee81-e676-4a58-9a32-054884376c0c#aggregation"/>
 77   </rdf:Description>
 78   <rdf:Description rdf:about="https://cn-stage.test.dataone.org/cn/v2/resolve/urn%3Auuid%3A825cee81-e676-4a58-9a32-054884376c0c">
 79     <ore:isAggregatedBy rdf:resource="https://cn-stage.test.dataone.org/cn/v2/resolve/resource_map_urn%3Auuid%3A825cee81-e676-4a58-9a32-054884376c0c#aggregation"/>
 80   </rdf:Description>
 81   <rdf:Description rdf:about="https://cn-stage.test.dataone.org/cn/v2/resolve/resource_map_urn%3Auuid%3A825cee81-e676-4a58-9a32-054884376c0c">
 82     <dcterms:creator rdf:nodeID="_fc7aa8bb-e3a6-4906-9816-447ec5d95204"/>
 83   </rdf:Description>
 84   <rdf:Description rdf:about="https://cn-stage.test.dataone.org/cn/v2/resolve/urn%3Auuid%3A4abb0f9f-260c-4b43-a2a7-df7578703a82">
 85     <cito:isDocumentedBy rdf:resource="https://cn-stage.test.dataone.org/cn/v2/resolve/urn%3Auuid%3A825cee81-e676-4a58-9a32-054884376c0c"/>
 86   </rdf:Description>
 87   <rdf:Description rdf:about="https://cn-stage.test.dataone.org/cn/v2/resolve/urn%3Auuid%3A14b1048d-ce39-4439-8f7c-0d05a7cc6cfc">
 88     <cito:isDocumentedBy rdf:resource="https://cn-stage.test.dataone.org/cn/v2/resolve/urn%3Auuid%3A825cee81-e676-4a58-9a32-054884376c0c"/>
 89   </rdf:Description>
 90 </rdf:RDF>

This pid was from the original, downloaded package, and should have been deleted (then replaced) from the datapackage 'dp' as well as all it's relationships. When the package is serialized, all package members have their pids 'promoted' to include the DataONE resolve URL. Since this pid was no longer in the package list, but it's relationships were, the pid was not 'promoted' and this caused a problem for the D1 indexer, as it was shown to be 'isDocumentedBy' but was not resolvable.

It should be able to detect these type of pids, that have isDocumentedBy relationships but are no longer package members. Then a warning could be printed, or the offending relationships be removed.