Open antibozo opened 5 years ago
Hi there @antibozo, Thanks for the comments. I'll address them individually:
It appears that the EML specification defines a namespace for the root element of a document, but proceeds to use the empty namespace for every other element.
EML follows the Namespaces in XML recommendation, particularly section 6.2 regarding default namespaces. When you declare a namespace on an element, that namespace applies to all non-prefixed child elements. So, to be clear, we use the default namespace, not the "empty" namespace. XML parsers handle this quite well.
the namespace used for the root element uses an idiosyncratic scheme "eml:"
The namespace follows the spec in that it must be a URI. I'd agree that it is idiosyncratic (done 20 years ago when XML was fairly young). That said, it is basically just a URI string, and processors handle it just fine. I've seen the use of http:
schemes, urn:
schemes, and others. In section 1.1.2 of the URI RFC 3986, it gives news:comp.infosystems.www.servers.unix
and tel:+1-816-555-1212
as examples. I'll leave this to others to comment on whether or not we think this is "weird" enough to move to an https:
scheme. I'd venture to say that most people over 20 look back on decisions they made 20 years prior with skepticism 😉.
the xsd:schemaLocation URI also uses this otherwise-undefined eml: schema, making it unretrievable for any automated schema validation code
As I'm sure you are well aware, there are security implications of dereferencing XML Schema documents from the xsi:schemaLocation
attribute, and there has been endless discussion of the practical threat of XML injection attacks. Authors of EML documents of course are free to use any xsi:schemaLocation
values they feel are best for their needs, and so ultimately you can't rely on this field to actually resolve. In the DataONE network we see URIs in this field that are no longer maintained.
Suffice it to say that we (at NCEAS, I can't speak for other EML producers) have opted to use the xsi:schemaLocation
as the intended "hint" to processors only. In our applications, we instead use locally cached and trusted copies of XML schemas in an XML Catalog which is well supported by parsing libraries. Aside from the potential security issues, this allows us to use fixed copies of schemas so that we know when an instance document claims to be adhering to a particular namespace'd schema, but actually points to a modified copy of the schema in the xsi:schemaLocation
field (as in cases where schema developers choose to not strongly version their published schemas).
As an aside, the value you suggest (xsi:schemaLocation="http://ecoinformatics.org/eml-2.1.1.xsd"
) would cause the processor to fail, since in XML-Schema-instance documents this must be a pair of strings separated by whitespace, indicating the namespace and the hint to it's location, or multiples thereof.
Perhaps you were thinking of the use of schemaLocation
in the context of an <include>
element in XML Schema documents, which defines just the actual schema location, not a hint pair.
There's a good discussion discerning these two uses in the XML Schema Primer.
It appears that the EML specification defines a namespace for the root element of a document, but proceeds to use the empty namespace for every other element.
EML follows the Namespaces in XML recommendation, particularly section 6.2 regarding default namespaces. When you declare a namespace on an element, that namespace applies to all non-prefixed child elements. So, to be clear, we use the default namespace, not the "empty" namespace. XML parsers handle this quite well.
Actually, you do not use a default namespace. There is no default namespace. There is a namespace only on the root "
my $xpc = XML::LibXML::XPathContext->new($dom);
$xpc->registerNs('x', 'eml://ecoinformatics.org/eml-2.1.1');
This code assigns the temporary prefix "x" to the namespace matching the EML URI. So now I can use that prefix to search for all elements in the EML namespace:
my @nodes = $xpc->findnodes('//x:eml');
This search works exactly as expected. I find one node, the root "
@nodes = $xpc->findnodes('//x:dataset');
This search does not work as expected based on your description above. The expectation is that all of the nodes in the document have the default namespace, namely, "eml://ecoinformatics.org/eml-2.1.1", as above.
@nodes = $dom->findnodes('//dataset');
So now I am searching for the same nodes, but with no namespace. I get exactly one node, the
print $nodes[0]->namespaceURI(), "\n";
And this results in an error, because, well, because the
This can also be shown on the linux command line using xml_grep. A search for the root node finds it:
$ xml_grep //eml:eml resource_map_doi_10_18739_A2222R550/science_metadata.xml
<?xml version="1.0" ?>
<xml_grep version="0.7" date="Thu Mar 28 17:31:23 2019">
<file filename="resource_map_doi_10_18739_A2222R550/science_metadata.xml"><eml:eml packageId="doi:10.18739/A2222R550" system="https://arcticdata.io" xmlns:eml="eml://ecoinformatics.org/eml-2.1.1" xmlns:stmml="http://www.xml-cml.org/schema/stmml-1.1" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="eml://ecoinformatics.org/eml-2.1.1 eml.xsd">
...
A search for the child dataset node does not:
$ xml_grep //eml:dataset resource_map_doi_10_18739_A2222R550/science_metadata.xml
$
Again, that's because the dataset node has no namespace. And searching for a non-namespaced dataset node finds it:
$ xml_grep //dataset resource_map_doi_10_18739_A2222R550/science_metadata.xml
<?xml version="1.0" ?>
<xml_grep version="0.7" date="Thu Mar 28 17:32:04 2019">
<file filename="/nodc/data/DataONE-ADC/ADC_packages_100/resource_map_doi_10_18739_A2222R550/science_metadata.xml">
<dataset>
<title>Atmospheric measurements via Multiple Axis Differential Optical Absorption Spectroscopy (MAXDOAS), Utqiagvik (Barrow), Alaska 2012-2018</title>
...
Aside from the xsi:schemaLocation error in antibozo's proffered declaration, it is correct. Here is another correct declaration that does not try to address his other observations:
<eml xmlns="eml://ecoinformatics.org/eml-2.1.1" xmlns:stmml="http://www.xml-cml.org/schema/stmml-1.1" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" packageId="doi:10.18739/A2222R550" system="https://arcticdata.io" xsi:schemaLocation="eml://ecoinformatics.org/eml-2.1.1 eml.xsd">
When I change the declaration of the EML namespace thus, and remove the eml: prefix from the root node, then the namespace works as you think it should. That is, a search for '//x:eml' finds the root node, a search for '//x:dataset' finds the child "dataset" node of the root node, and a search for "//x:para" in my example document (the science metadata from doi:10.18739/A2222R550) finds 28 matches.
Another possible implementation is this:
<eml:eml xmlns:eml="eml://ecoinformatics.org/eml-2.1.1" xmlns:stmml="http://www.xml-cml.org/schema/stmml-1.1" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" packageId="doi:10.18739/A2222R550" system="https://arcticdata.io" xsi:schemaLocation="eml://ecoinformatics.org/eml-2.1.1 eml.xsd">
<eml:dataset>
...
Where every node in the EML namespace has the prefix "eml". That is also a correct implementation.
In
<eml:eml
packageId="eml.1.1" system="knb"
xmlns:eml="eml://ecoinformatics.org/eml-2.1.1"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="eml://ecoinformatics.org/eml-2.1.1 eml.xsd">
you aren't declaring a namespace for child elements; you are declaring only a namespace prefix, which you apply solely to the root element. The default namespace of child elements is unaffected by a declaration of this kind, and remains the empty namespace. To put child elements in the EML namespace in this syntax, you must add a namespace prefix to them just as you have done with the root element, e.g. eml:title
.
To declare a default namespace, you must write something like:
<?xml version="1.0"?>
<eml
packageId="eml.1.1" system="knb"
xmlns="http://ecoinformatics.org/eml-2.1.1"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://ecoinformatics.org/eml-2.1.1.xsd">
Note the xmlns=
: this is what sets the default namespace. A namespace declared in this way applies to the element in which it is declared (unless that element specifies a namespace prefix) and all child elements that lack a namespace prefix.
To see this, take a sample document and pass it through something that prints out the namespace of each element.
Here is a version of the sample document found at the end of chapter 2 in your specification, with typos corrected (""
in lang attribute and bad closing tags on <para>
and <keyword>
elements):
<?xml version="1.0"?>
<eml:eml
packageId="eml.1.1" system="knb"
xml:lang="pt_BR"
xmlns:eml="eml://ecoinformatics.org/eml-2.1.1"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="eml://ecoinformatics.org/eml-2.1.1 eml.xsd">
<dataset id="ds.1">
<!-- English title with Portuguese translation -->
<title xml:lang="en_US">
Sample Dataset Description
<value xml:lang="pt_BR">Exemplo Descrição Dataset</value>
</title>
...
<!-- Portuguese abstract with English translation -->
<abstract>
<para>
Neste exemplo, a tradução em Inglês é secundário
<value xml:lang="en_US">In this example, the English translation is secondary</value>
</para>
</abstract>
...
<!-- two keywords, each with an equivalent translation -->
<keywordSet>
<keyword keywordType="theme">
árvore
<value xml:lang="en_US">tree</value>
</keyword>
<keyword keywordType="theme">
água
<value xml:lang="en_US">water</value>
</keyword>
</keywordSet>
...
</dataset>
</eml:eml>
and here is a simple XSL that prints out the namespace URI and local name of each element in document order:
<?xml version='1.0' encoding='UTF-8'?>
<xsl:stylesheet
version='1.0'
xmlns:xsl='http://www.w3.org/1999/XSL/Transform'
>
<xsl:output method='text' media-type='text/plain' encoding='UTF-8' />
<xsl:template match='/'>
<xsl:apply-templates select='//*' />
</xsl:template>
<xsl:template match='*'>
<xsl:text>{</xsl:text>
<xsl:value-of select='namespace-uri()' />
<xsl:text>}</xsl:text>
<xsl:value-of select='local-name()' />
<xsl:text> </xsl:text> <!-- newline -->
</xsl:template>
</xsl:stylesheet>
Process the sample XML through this XSL and you will get:
{eml://ecoinformatics.org/eml-2.1.1}eml
{}dataset
{}title
{}value
{}abstract
{}para
{}value
{}keywordSet
{}keyword
{}value
{}keyword
{}value
in which you see that every element except the root eml
element has an empty namespace.
If you correct the sample XML to define a default namespace, thus:
<?xml version="1.0"?>
<eml
packageId="eml.1.1" system="knb"
xml:lang="pt_BR"
xmlns="eml://ecoinformatics.org/eml-2.1.1"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="eml://ecoinformatics.org/eml-2.1.1 eml.xsd">
<dataset id="ds.1">
<!-- English title with Portuguese translation -->
<title xml:lang="en_US">
Sample Dataset Description
<value xml:lang="pt_BR">Exemplo Descrição Dataset</value>
</title>
...
<!-- Portuguese abstract with English translation -->
<abstract>
<para>
Neste exemplo, a tradução em Inglês é secundário
<value xml:lang="en_US">In this example, the English translation is secondary</value>
</para>
</abstract>
...
<!-- two keywords, each with an equivalent translation -->
<keywordSet>
<keyword keywordType="theme">
árvore
<value xml:lang="en_US">tree</value>
</keyword>
<keyword keywordType="theme">
água
<value xml:lang="en_US">water</value>
</keyword>
</keywordSet>
...
</dataset>
</eml>
and process this through the same XSL, this yields:
{eml://ecoinformatics.org/eml-2.1.1}eml
{eml://ecoinformatics.org/eml-2.1.1}dataset
{eml://ecoinformatics.org/eml-2.1.1}title
{eml://ecoinformatics.org/eml-2.1.1}value
{eml://ecoinformatics.org/eml-2.1.1}abstract
{eml://ecoinformatics.org/eml-2.1.1}para
{eml://ecoinformatics.org/eml-2.1.1}value
{eml://ecoinformatics.org/eml-2.1.1}keywordSet
{eml://ecoinformatics.org/eml-2.1.1}keyword
{eml://ecoinformatics.org/eml-2.1.1}value
{eml://ecoinformatics.org/eml-2.1.1}keyword
{eml://ecoinformatics.org/eml-2.1.1}value
Because of the way you have written this specification, along with EML documents we are actually observing, it is not possible when processing mixed documents to distinguish EML elements from elements in the empty namespace.
As for XML injection, you aren't going to protect people from XML injection by using weird schemes in your URIs. It is up to people to protect themselves with appropriate steps, such as disabling external entity resolution. And, after all, eml:
might actually refer to an access protocol someday.
Regardless of the correct value for xsi:schemaLocation
, the issue is the weird scheme you have included in the URI.
Yes, we have all done things 20 years ago that needed to be modified as things became clearer. But fixing mistakes is still a good thing. It is commonplace to update namespace URIs when new versions are created, so perhaps this is a good time to do so.
One other note: it is perfectly fine to use http
, rather than https
, in a namespace URI. It's an identifier, not necessarily a locator.
One other note about default namespaces:
Again, consider the sample document from chapter 2 of the specification:
<?xml version="1.0"?>
<eml:eml
packageId="eml.1.1" system="knb"
xml:lang="pt_BR"
xmlns:eml="eml://ecoinformatics.org/eml-2.1.1"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="eml://ecoinformatics.org/eml-2.1.1 eml.xsd">
<dataset id="ds.1">
…
</dataset>
I wrote earlier that elements such as <dataset>
are in the empty namespace in this example, but this is not strictly true: instead they are in the default namespace, which i am assuming to be empty. If the EML document is included in another document with its own default namespace, e.g.:
<?xml version="1.0"?>
<foo xmlns='http://foo.example.org'>
<eml:eml
packageId="eml.1.1" system="knb"
xml:lang="pt_BR"
xmlns:eml="eml://ecoinformatics.org/eml-2.1.1"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="eml://ecoinformatics.org/eml-2.1.1 eml.xsd">
<dataset id="ds.1">
…
</dataset>
</eml:eml>
<foo>
then <dataset>
will be found in the http://foo.example.org
namespace, and processors that expect to find it in the empty namespace will fail.
Hopefully it is altogether clear by now why this is real flaw in the specification.
… and, following the concern about the weird scheme: again, it is possible eml:
could become an access protocol or otherwise defined URI scheme someday as codified in an RFC having nothing to do with markup of ecological metadata. That RFC may constrain the syntax of any URI with an eml:
scheme in a way that makes the URIs you have used in your specification non-compliant with an Internet standard, and at the very least will make these URIs meaningless or misleading.
Use http:
. It is what one does.
Hi Jeff and John,
Ah, yes, I stand corrected on the default namespace issue. After reading your explanations, and looking at this more closely, I see that only the root element is namespaced in these instance documents, and that the children have no namespace. This perplexed me a bit since the Xerces-J parser we tend to use to validate these documents validates them fine, and in fact they are completely valid with regard to adhering to the EML schema. As you point out, the main issue is that we (Arctic Data Center) omitted the default namespace attribute on the <eml>
element. While it doesn't matter for whole-document validation, it makes it harder to search for (as you pointed out), and to individually validate fragments of the document based on their intended schema sub-modules (for instance grabbing the /eml/dataset/coverage
fragment and validating it against the eml-coverage.xsd
schema).
So, my understanding is that this isn't really an EML schema issue, but it is rather an authoring issue on our part (and likely other EML document authors). When I add just the default namespace attribute of xmlns="eml://ecoinformatics.org/eml-2.1.1"
to an instance document, and transform the document using Jeff's XSLT above using Xalan Java, I get:
{eml://ecoinformatics.org/eml-2.1.1}eml
{eml://ecoinformatics.org/eml-2.1.1}dataset
{eml://ecoinformatics.org/eml-2.1.1}title
{eml://ecoinformatics.org/eml-2.1.1}creator
{eml://ecoinformatics.org/eml-2.1.1}individualName
{eml://ecoinformatics.org/eml-2.1.1}givenName
{eml://ecoinformatics.org/eml-2.1.1}surName
{eml://ecoinformatics.org/eml-2.1.1}organizationName
{eml://ecoinformatics.org/eml-2.1.1}positionName
{eml://ecoinformatics.org/eml-2.1.1}address
{eml://ecoinformatics.org/eml-2.1.1}deliveryPoint
{eml://ecoinformatics.org/eml-2.1.1}city
...
It doesn't require any namespace change to the root <eml:eml>
element, and validates as well. So, I'm wondering if your libxml2
library treats it any differently?
So, I will bring this up with the ADC group, and we'll discuss adding in the default namespace attribute in future documents. That said, for your processing purposes, I think you will need to inject the xmlns="eml://ecoinformatics.org/eml-2.1.1"
attribute into your DOM prior to processing the document in your Perl code so that you can use find()
as intended.
So, before we close this issue, I think the summary of the three items raised are:
http:
as opposed to eml:
. I'm fine with this in theory, but we need to hear from others in the community, particularly @mbjones, @vdave, @laurenwalker, @mpsaloha, @mobb, @amoeba , @cboettig, and many others that may have both opinions and/or software that relies on the namespace scheme as it is currently. We would also need to discuss any backward-compatibility issues. Depending on the opinions of the community, we may create a new distinct ticket for this issue.xsi:schemaLocation
namespace/location pairs is a per-author decision, and can practically only be used as a hint to dereference a schema. I think this is more of a group-by-group decision, and so there's nothing to resolve here per se, although I certainly acknowledge the convenience of it, and the problems that it can raise as well.Regardless of whether you make xsi:schemaLocation
resolvable, i strongly advise you against using an undefined scheme that could someday become defined. If you don't want something that looks like a URL to resolve to an actual document, simply do not put an actual document at that URL. Or use a URI that is not a URL but conforms to an existing defined scheme such as urn:
.
The lack of a default namespace means that all child elements in an EML document are in whatever namespace governs their context. The real problem here, however, is that the child elements are not in the originally intended namespace. Adding a default namespace is one way to change this, and using a namespace prefix is another; i'll discuss this further down.
The immediate problem is that any use case that uses namespaces to locate elements will have a problem with putting the child elements in the originally intended namespace. You have both a backward and forward compatibility issue. If someone has, for example, written an EML processor that searches for {}dataset
elements because that's what has been coming over the transom until now, then that processor will stop finding the elements it was looking for because they are now {eml://ecoinformatics.org/eml-2.1.1}dataset
elements. If, on the other hand, you leave things as they are, and someone is taking the declared intent in the specification that "The eml module is a wrapper container that allows the inclusion of any metadata content in a single EML document" at its word, and embedding an eml:eml
inside, perhaps, another element that has a default namespace, say http://foo.example.com
, perhaps inside another eml:eml
element, then <dataset>
inside that interior eml:eml
element will be {http://foo.example.com}dataset
, while <dataset>
in the outer eml:eml
element will be {}dataset
.
The correct thing to do, i believe, is to explicitly say that all elements defined in the EML specification are in the EML namespace. The usual way to make this clear in a specification is to use a namespace prefix on every element, and not to rely on default namespaces, because default namespaces have scope. I.e., your sample record would look like:
<?xml version="1.0"?>
<eml:eml
packageId="eml.1.1" system="knb"
xml:lang="pt_BR"
xmlns:eml="http://ecoinformatics.org/eml-2.2"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://ecoinformatics.org/eml-2.1.1 eml.xsd">
<eml:dataset id="ds.1">
<eml:title xml:lang="en_US">
…
To be clear, this is going to break existing processors that were handling the EML records that actually exist by finding child elements in the empty namespace. You'll see in this example i have also increased the version number in the namespace URI to accommodate this.
Heh - disregard that last comment (I deleted it) - wrong ticket. 🙂
A couple of points of clarification, just so we are on the same page:
elementFormDefault=unqualified
makes it so that children elements that are local to a complex type are in the empty namespace while still allowing validation to follow the content model. It essentially prevents authors from having to switch namespaces as they go up and down the EML tree. Due to these multiple namespaces, there is no single namespace that could be correctly the default with our current setup if we used a qualified
default, and this is pretty awkward. I agree it would be good to change it.I've spent some time looking at this proposal, and talking to a few folks about it. While our current use of elementFormDefault=unqualified
is perfectly legitimate use of XML Schema (and is the default), I think we agree that changing it to allow qualified child elements would be helpful. The challenge is in how to do this in as compatible a way as possible to avoid breaking existing tools that rely on the current approach in EML. In particular, most of our editing applications (MetacatUI, Morpho) and our display stylesheets in XSLT have hundreds of encoded XPaths that reference local children elements in the empty namespace that would need to be updated to be namespace aware should we make this change. Thus, my conclusion is that this change to qualified
elements would be backwards incompatible, and so does not belong in the EML 2.2 release which is meant to be fully compatible with previous releases.
That said, I think its a good idea for a 3.0.0 release, and here is what I think we should change.
elementFormDefault
to qualified
I think at that point, people could set a default namespace on their documents and it would be used for any elements not explicitly prefixed. The only place it would need to be changed would be in additionalMetadata
when other namespaces are used such as STMML. That might cause some validation issues with existing documents that we would need to thoroughly test, but I think it would be workable.
One outstanding question for me is:
qualified
? That has slightly different semantics and implications than elementFormDefault
. In any case, due to the compatibility issue, I will retarget this to milestone 3.0.0.
Comments appreciated.
It appears that the EML specification defines a namespace for the root element of a document, but proceeds to use the empty namespace for every other element. This is very weird, and makes the use of any namespace at all seemingly useless.
In addition, the namespace used for the root element uses an idiosyncratic scheme "eml:". This is also very weird.
Furthermore, the xsd:schemaLocation URI also uses this otherwise-undefined eml: schema, making it unretrievable for any automated schema validation code.
Example from current spec:
Here the root element is {eml://ecoinformatics.org/eml-2.1.1}eml, but the first child is {}dataset.
The normal way to do this, which doesn't perplex everyone who tries to process an XML document written to spec, would be:
Here, all of the elements, from the root down, are in the {http://ecoinformatics.org/eml-2.1.1} namespace. This assures that elements intended to mean EML things can be distinguished from elements with the same local name that mean something else, e.g. "title".
Please rewrite the spec to use namespaces in a non-eyebrow-raising manner, or explain why you have written it as it is.