NCEAS / arcticdatautils

Utility functions in R for processing data for the Arctic Data Center
https://nceas.github.io/arcticdatautils/
Apache License 2.0
10 stars 20 forks source link

generation of resource maps with invalid characters #51

Open jeanetteclark opened 6 years ago

jeanetteclark commented 6 years ago

@amoeba Edit: An MRE for this is:

library(arcticdatautils)
library(dataone)

mn <- MNode("https://dev.nceas.ucsb.edu/knb/d1/mn/v2")
pkg <- create_dummy_package(mn)
new_pkg <- publish_update(mn, pkg$metadata, pkg$resource_map, pkg$data)
cat(rawToChar(getObject(mn, new_pkg$resource_map)))

Noting the output:

<rdf:Description rdf:about="_:r1511139237r13053r1">
    <foaf:name rdf:resource="&quot;DataONE R Client&quot;^^&lt;http://www.w3.org/2001/XMLSchema#string"/>
</rdf:Description>

[original issue below]

library(dataone)
library(datapack)
library(arcticdatautils)
library(EML)

cn <- CNode('STAGING2')
mn <- getMNode(cn,"urn:node:mnTestKNB")

#write metadata and attach a data file in registry on dev.nceas.ucsb.edu
id <- 'knb.109096.1'

#read in registry EML
outpath <- '~/example.xml'
writeBin(getObject(mn, id), outpath)

#make edits and save EML
eml <- read_eml(outpath)
eml@dataset@abstract@para@.Data[[2]] <- new('para', .Data = 'and edited using R') #this does not actually change anything, annoyingly
write_eml(eml, outpath)

#get ids from initial submission
ids <- get_package(mn, id)
#update EML with new version using publish_update
id_new <- publish_update(mn, metadata_pid = ids$metadata, resource_map_pid = ids$resource_map, data_pids = ids$data, metadata_path = outpath)

These steps generate a resource map with this, evidently problematic line:

  <rdf:Description rdf:about="https://cn-stage-2.test.dataone.org/cn/v1/resolve/knb.109095.1">
    <dcterms:identifier rdf:resource="&quot;knb.109095.1&quot;^^&lt;http://www.w3.org/2001/XMLSchema#string"/>
  </rdf:Description>

Link to package: https://dev.nceas.ucsb.edu/#view/urn:uuid:0c720fed-bbe9-4076-9e59-7636730b3d5a

Attached is a list of the 854 resource maps with this problem on the ADC. There are likely some on the KNB as well

bad_rms_ADC.txt

amoeba commented 6 years ago

Awesome, thanks for the bug report. I'll take a look at this and see which piece of software this bug lives inside.

@jeanetteclark could you elaborate on this line of your code snippet?

eml@dataset@abstract@para@.Data[[2]] <- new('para', .Data = 'and edited using R') #this does not actually change anything, annoyingly

What does that mean?

amoeba commented 6 years ago

PS @jeanetteclark were you able to reproduce a resource map that had this bogus content?

<rdf:Description rdf:about="file:///tmp/RtmphWZjPl/_:r1510618411r30298r1">
    <foaf:name rdf:resource="file:///tmp/RtmphWZjPl/&quot;DataONE R Client&quot;^^&lt;http://www.w3.org/2001/XMLSchema#string"/>
</rdf:Description>

Specifically, the file:///tmp part

amoeba commented 6 years ago

Looks like the " part of this bug is related to the custom resource map parsing routine I had to put in for arcticdatautils to support PROV a while back. This package uses that routine to update an existing resource map to the next version of the package. To do that, all the triples are loaded from the RDF/XML into a data.frame, some simple logic is used to only update triples relating to Data Packaging (basically: documented/isDocumentedBy & aggregates/aggregatedBy and some more) while leaving the rest in (e.g. PROV). The routine I'm using spits out these rows:

> statements
                 subject                                       predicate                                                       object
4  _:r1511136883r13440r1                  http://xmlns.com/foaf/0.1/name "DataONE R Client"^^<http://www.w3.org/2001/XMLSchema#string
22 _:r1511136883r13440r1 http://www.w3.org/1999/02/22-rdf-syntax-ns#type                               http://purl.org/dc/terms/Agent

If you take a look at the Object column, you'll see the text:

"DataONE R Client"^^<http://www.w3.org/2001/XMLSchema#string

which looks to be the cause of the &quot; @gothub you recently implemented some routine(s) similar to this in datapack. I haven't looked at them yet but perhaps yours work better and arcticdatautils should switch using them?

amoeba commented 6 years ago

Oh, and a PS: I didn't catch this bug during development/testing because I didn't try parsing registry-created resource maps and only tested on resource maps built in R. PPS: Added an MRE at the top which shows what's going on a little more clearly.

gothub commented 6 years ago

@amoeba the datapack resource map parsing routines parseRDF(), getTriples() currently handles PROV statement.

I've been think about a way to 'repair' the resource maps using datapack, but don't have all the details yet.

jeanetteclark commented 6 years ago

eml@dataset@abstract@para@.Data[[2]] <- new('para', .Data = 'and edited using R') #this does not actually change anything, annoyingly

I think this is a problem with my EML code. for some reason it does not add a new paragraph to the EML.

I haven't been able to reproduce anything with the file:// yet but I can try this morning...I have an idea about that one

amoeba commented 6 years ago

PPS: At least the &quot; regression was very likely introduced in https://github.com/NCEAS/arcticdatautils/tree/v0.5.4

amoeba commented 6 years ago

Oh, thanks @gothub. I had looked a few times before and not seen those methods. From their names, it sounds like they'll work nicely.

csjx commented 6 years ago

Here's the list of 1184 resource map identifiers and their upload date. This list has RDF documents with either file:// URIs in it, or incorrect dcterms:identifer fields with &quot; entities in the statement.

pids-with-bad-ids-and-uris.txt

Note that this ticket is a duplicate of https://github.nceas.ucsb.edu/KNB/arctic-data/issues/247 which describes the same problem.

gothub commented 6 years ago

Does anyone have any clues where the 'file:///...' strings were introduced in this workflow? I haven't found it yet in the R client.

Also, I'm also noticing that the dcterms:identifier triples have been converted to rdf:resources which is incorrect, they should be literals. Here is a sample of the incorrect one:

  <rdf:Description rdf:about="https://cn.dataone.org/cn/v1/resolve/arctic-data.10018.1">
    <dcterms:identifier rdf:resource="file:///tmp/RtmppO7bqc/&quot;arctic-data.10018.1&quot;^^&lt;http://www.w3.org/2001/XMLSchema#string"/>
  </rdf:Description>

Which should be

  <rdf:Description rdf:about="https://cn.dataone.org/cn/v1/resolve/arctic-data.10018.1">
    <dcterms:identifier rdf:datatype="http://www.w3.org/2001/XMLSchema#string">arctic-data.10018.1</dcterms:identifier>
  </rdf:Description>
csjx commented 6 years ago

@gothub My guess is that there is some processing code that is inadvertently passing the file object reference to arcticdatautils::update_resource_map() instead of the identifier of the object, and so R is trying to serialize a string from the object as best it can, and ends up spitting out the file:///... URI of the object. This similarly happens in Java when you try to call System.out.println(myObject) and you print the object's memory address rather than the object name. Of course I'm speculating, but that is the direction I would look here as a start.

amoeba commented 6 years ago

The relevant parts of arcticdatautils I'd blame for bad behavior is these hacks of functions: https://github.com/NCEAS/arcticdatautils/blob/c0adccefb452c38320b00394d32a034537243fed/R/packaging.R#L1037 https://github.com/NCEAS/arcticdatautils/blob/c0adccefb452c38320b00394d32a034537243fed/R/packaging.R#L1105

parse_resource_map is particularly hackish

jagoldstein commented 6 years ago

https://arcticdata.io/catalog/#view/doi:10.18739/A2136S I was able to add prov to the package found here ^ even though the first version was submitted via the registry and I later updated it w arcticdatautils. Peter speculates that this worked because the RDF update was performed prior to June 2017

jeanetteclark commented 6 years ago

I checked the old version of that RM and it did not have either the file:// or the "&quot; strings

csjx commented 6 years ago

I found 154 more in the KNB:

pids-with-bad-ids-and-uris-knb.txt

gothub commented 6 years ago

The R packages dataone and datapack are being updated to repair the problems that we have seen with resource maps:

The workflow in R would be:

d1c <- D1Client("PROD", "urn:node:ARCTIC")
pkg <- getDataPackage(mn, id="resource_map_doi:10.18739/A2XT16", lazy=T, limit="0MB", quiet=F, repair=TRUE)

The package relationships would then be manually inspected to verify correctness.

pkg

Then the package is uploaded, with only the resource map being updated:

newId <- uploadDataPackage(d1c, pkg, quiet=FALSE)

Once this has been done for a representative sample of the affected resource maps, then the process can be automated to update the rest.

jagoldstein commented 6 years ago

@amoeba I am experimenting with updating EMLs with arcticdatautils both before and AFTER we have added prov. It seems the issue may only apply to RDFs that were updated via that library after June 2017.

A patch may be in order to prevent RDFs updated through arcticdatautils from inhibiting the addition of prov relationships. This may not be news to you nor very helpful info, but I am documenting my 2 cents here.

amoeba commented 6 years ago

Thanks @jagoldstein, that is news and is helpful. I'm not actively working on this but I'm watching this thread so the extra info is helpful.

dmullen17 commented 6 years ago

@gothub I found another way that the resource map error can crop up:
I uploaded a data package to the arctic data center with arcticdatautils using publish_object and create_resource_map. The error is not present in this resource map first resource map.
However if i use publish_update to give the data package a DOI, the new resource map exhibits the error, even though this data package did not originate from the registry. The same error comes up if publish an xml using a pre-generated DOI.