DataONEorg / rdataone

R package for reading and writing data at DataONE data repositories
http://doi.org/10.5063/F1M61H5X
36 stars 19 forks source link

EML schemaLocation format error upon using uploadDataPackage() #278

Closed earnaud closed 2 years ago

earnaud commented 3 years ago

Hello,

I met an error while using uploadDataPackage(): Error creating urn:uuid:6f251683-4642-4243-aee3-0cb0522dee15: Error inserting or updating document: urn:uuid:6f251683-4642-4243-aee3-0cb0522dee15 since <?xml version="1.0"?><error>SchemaLocation: schemaLocation value = 'https://eml.ecoinformatics.org/eml-2.2.0/eml.xsd' must have even number of URI's.</error> Since the error seems to be occurring out of the scope of this package, I will have difficulties to explore all of your code and fix my concerns.

I tried to download a metadata file (EML 2.2.0) and its two associated data files in a data pack. I targetted the Arctic Data test repo (urn:node:mnTestARCTIC). The EML file was validated at the output of EML Assembly Line package.

I tried to set format for metadata part of the package as https://eml.ecoinformatics.org/eml-2.2.0, following the recommendation from Arctic Data support.

Find attached the files (.zip) and script (.txt for github support) used to reproduce the error.

dataone_issue_files.zip

script.txt

mbjones commented 3 years ago

Looking at your example document, it seems your xsi:schemaLocation attribute is missing a URI, as indicated in the error. The field xsi:schemaLocation is meant to take pairs of URIs, the first being the URI for the namespace, and the second being the URI for the location of the XSD file for that namespace. See the XML in a Nutshell explanation for details.

I think you can fix your document by changing the root element to:

<eml:eml
  xmlns:eml="https://eml.ecoinformatics.org/eml-2.2.0" 
  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
  xmlns:stmml="http://www.xml-cml.org/schema/stmml-1.2" 
  packageId="Test" 
  xsi:schemaLocation="https://eml.ecoinformatics.org/eml-2.2.0/eml.xsd https://eml.ecoinformatics.org/eml-2.2.0/eml.xsd" 
  system="uuid">
...
</eml:eml>

Note that I repeated the same URI twice in the pair, once as the namespace name, and once as the XSD location, separated by a space. Alternatively, you could omit xsi:schemaLocation entirely, as it is entirely optional and I think it is generally best practice for most processors to ignore it (which I have commented on elsewhere).

amoeba commented 3 years ago

Heya @earnaud, thanks for the report.

I wanted to follow up on this bit:

The EML file was validated at the output of EML Assembly Line package.

I checked your script and I see a gsub call that seems like a likely suspect in producing the invalid EML you mention:

eml.format <- doc$schemaLocation |>
    gsub(pattern = "(eml-[0-9]+\\.[0-9]+\\.[0-9]+).+$", replacement = "\\1")

What is the bit attempting to do? If I run some test data through it, I see what looks like a cause for the issue. See here where I pass a reasonable and valid schemaLocation string into your gsub call and get invalid output:

> gsub("https://eml.ecoinformatics.org/eml-2.1.1 https://eml.ecoinformatics.org/eml-2.1.1", pattern = "(eml-[0-9]+\\.[0-9]+\\.[0-9]+).+$", replacement = "\\1")
[1] "https://eml.ecoinformatics.org/eml-2.1.1" # Should be a pair of strings
earnaud commented 3 years ago

Thanks again for taking time on this one, @mbjones and @amoeba .

I did not know much about the xsi:schemaLocation item, and did not understand why it should have pairs of URI (and this still confuses me since both of the items in the pair are identical). But thanks for this insight, it greatly helps.

What is the bit attempting to do? If I run some test data through it, I see what looks like a cause for the issue. See here where I pass a reasonable and valid schemaLocation string into your gsub call and get invalid output:

I tried to get https://eml.ecoinformatics.org/eml-2.1.1/eml.xsd' look like 'https://eml.ecoinformatics.org/eml-2.1.1' (which was recommended to me by Arctic Data support team). However, I did not know about the pair value expected byxsi:schemaLocation`. I will try to fix this.

amoeba commented 3 years ago

Gotcha.

The idea behind xsi:schemaLocation and it using pairs of strings is so you can specify the location (URL) of a schema separately from the identifier (URI) of that schema. Sometimes your schema is not located at its URI, you want to use a custom location somewhere else on the web, or you can even use a local copy. Ultimately, it's just a hint for XML processing tools and only some tools will even make use of it.

As an example of where the pair of values might deviate, we could specify a location on our own computer. Say I have the EML 2.1.1 schema files at ~/eml/eml.xsd, I could use xsi:schemaLocation="eml://ecoinformatics.org/eml-2.1.1 ~/eml/eml.xsd".

Hope that helps. And do let us know how you get on.

mbjones commented 3 years ago

Slight syntactic update on Bryce's example, as the 2.1.1 namespace is not quite right -- it should be eml://ecoinformatics.org/eml-2.1.1 -- we switched to using https-based namespaces in EML 2.2.0. So the location hint would be xsi:schemaLocation="eml://ecoinformatics.org/eml-2.1.1 ~/eml/eml.xsd" for that version of EML. For version 2.2.0, it would be something like xsi:schemaLocation="https://eml.ecoinformatics.org/eml-2.2.0 ~/eml/xsd/eml.xsd".

amoeba commented 3 years ago

Thanks @mbjones, I updated my example.

earnaud commented 3 years ago

I think this issue is fixed by now, due to this update in EML Assembly Line .