NCEAS / eml

Ecological Metadata Language (EML)
https://eml.ecoinformatics.org/
GNU General Public License v2.0
40 stars 15 forks source link

change EML namespace and elementFormDefault #334

Open antibozo opened 5 years ago

antibozo commented 5 years ago

It appears that the EML specification defines a namespace for the root element of a document, but proceeds to use the empty namespace for every other element. This is very weird, and makes the use of any namespace at all seemingly useless.

In addition, the namespace used for the root element uses an idiosyncratic scheme "eml:". This is also very weird.

Furthermore, the xsd:schemaLocation URI also uses this otherwise-undefined eml: schema, making it unretrievable for any automated schema validation code.

Example from current spec:

<?xml version="1.0"?>
<eml:eml
    packageId="eml.1.1" system="knb"
    xmlns:eml="eml://ecoinformatics.org/eml-2.1.1"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="eml://ecoinformatics.org/eml-2.1.1 eml.xsd">
  <dataset id="ds.1">
    <title>Sample Dataset Description</title>

Here the root element is {eml://ecoinformatics.org/eml-2.1.1}eml, but the first child is {}dataset.

The normal way to do this, which doesn't perplex everyone who tries to process an XML document written to spec, would be:

<?xml version="1.0"?>
<eml
    packageId="eml.1.1" system="knb"
    xmlns="http://ecoinformatics.org/eml-2.1.1"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="http://ecoinformatics.org/eml-2.1.1.xsd">
  <dataset id="ds.1">
    <title>Sample Dataset Description</title>

Here, all of the elements, from the root down, are in the {http://ecoinformatics.org/eml-2.1.1} namespace. This assures that elements intended to mean EML things can be distinguished from elements with the same local name that mean something else, e.g. "title".

Please rewrite the spec to use namespaces in a non-eyebrow-raising manner, or explain why you have written it as it is.

csjx commented 5 years ago

Hi there @antibozo, Thanks for the comments. I'll address them individually:

It appears that the EML specification defines a namespace for the root element of a document, but proceeds to use the empty namespace for every other element.

EML follows the Namespaces in XML recommendation, particularly section 6.2 regarding default namespaces. When you declare a namespace on an element, that namespace applies to all non-prefixed child elements. So, to be clear, we use the default namespace, not the "empty" namespace. XML parsers handle this quite well.

the namespace used for the root element uses an idiosyncratic scheme "eml:"

The namespace follows the spec in that it must be a URI. I'd agree that it is idiosyncratic (done 20 years ago when XML was fairly young). That said, it is basically just a URI string, and processors handle it just fine. I've seen the use of http: schemes, urn: schemes, and others. In section 1.1.2 of the URI RFC 3986, it gives news:comp.infosystems.www.servers.unix and tel:+1-816-555-1212 as examples. I'll leave this to others to comment on whether or not we think this is "weird" enough to move to an https: scheme. I'd venture to say that most people over 20 look back on decisions they made 20 years prior with skepticism 😉.

the xsd:schemaLocation URI also uses this otherwise-undefined eml: schema, making it unretrievable for any automated schema validation code

As I'm sure you are well aware, there are security implications of dereferencing XML Schema documents from the xsi:schemaLocation attribute, and there has been endless discussion of the practical threat of XML injection attacks. Authors of EML documents of course are free to use any xsi:schemaLocation values they feel are best for their needs, and so ultimately you can't rely on this field to actually resolve. In the DataONE network we see URIs in this field that are no longer maintained.

Suffice it to say that we (at NCEAS, I can't speak for other EML producers) have opted to use the xsi:schemaLocation as the intended "hint" to processors only. In our applications, we instead use locally cached and trusted copies of XML schemas in an XML Catalog which is well supported by parsing libraries. Aside from the potential security issues, this allows us to use fixed copies of schemas so that we know when an instance document claims to be adhering to a particular namespace'd schema, but actually points to a modified copy of the schema in the xsi:schemaLocation field (as in cases where schema developers choose to not strongly version their published schemas).

As an aside, the value you suggest (xsi:schemaLocation="http://ecoinformatics.org/eml-2.1.1.xsd") would cause the processor to fail, since in XML-Schema-instance documents this must be a pair of strings separated by whitespace, indicating the namespace and the hint to it's location, or multiples thereof. Perhaps you were thinking of the use of schemaLocation in the context of an <include> element in XML Schema documents, which defines just the actual schema location, not a hint pair.
There's a good discussion discerning these two uses in the XML Schema Primer.

relphj commented 5 years ago

It appears that the EML specification defines a namespace for the root element of a document, but proceeds to use the empty namespace for every other element.

EML follows the Namespaces in XML recommendation, particularly section 6.2 regarding default namespaces. When you declare a namespace on an element, that namespace applies to all non-prefixed child elements. So, to be clear, we use the default namespace, not the "empty" namespace. XML parsers handle this quite well.

Actually, you do not use a default namespace. There is no default namespace. There is a namespace only on the root "" element. Here is some code demonstrating that, using the LibXML2 library to parse the XML:

            my $xpc = XML::LibXML::XPathContext->new($dom);
        $xpc->registerNs('x', 'eml://ecoinformatics.org/eml-2.1.1');

This code assigns the temporary prefix "x" to the namespace matching the EML URI. So now I can use that prefix to search for all elements in the EML namespace:

        my @nodes = $xpc->findnodes('//x:eml');

This search works exactly as expected. I find one node, the root "" node of the document.

        @nodes = $xpc->findnodes('//x:dataset');

This search does not work as expected based on your description above. The expectation is that all of the nodes in the document have the default namespace, namely, "eml://ecoinformatics.org/eml-2.1.1", as above.

        @nodes = $dom->findnodes('//dataset');

So now I am searching for the same nodes, but with no namespace. I get exactly one node, the node child of the node.

        print $nodes[0]->namespaceURI(), "\n";

And this results in an error, because, well, because the node and all other nodes in the document do not have a namespace.

This can also be shown on the linux command line using xml_grep. A search for the root node finds it:

$ xml_grep //eml:eml resource_map_doi_10_18739_A2222R550/science_metadata.xml
<?xml version="1.0" ?>
<xml_grep version="0.7" date="Thu Mar 28 17:31:23 2019">
<file filename="resource_map_doi_10_18739_A2222R550/science_metadata.xml"><eml:eml packageId="doi:10.18739/A2222R550" system="https://arcticdata.io" xmlns:eml="eml://ecoinformatics.org/eml-2.1.1" xmlns:stmml="http://www.xml-cml.org/schema/stmml-1.1" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="eml://ecoinformatics.org/eml-2.1.1 eml.xsd">
...

A search for the child dataset node does not:

$ xml_grep //eml:dataset resource_map_doi_10_18739_A2222R550/science_metadata.xml
$

Again, that's because the dataset node has no namespace. And searching for a non-namespaced dataset node finds it:

$ xml_grep //dataset resource_map_doi_10_18739_A2222R550/science_metadata.xml
<?xml version="1.0" ?>
<xml_grep version="0.7" date="Thu Mar 28 17:32:04 2019">
<file filename="/nodc/data/DataONE-ADC/ADC_packages_100/resource_map_doi_10_18739_A2222R550/science_metadata.xml">
  <dataset>
    <title>Atmospheric measurements via Multiple Axis Differential Optical Absorption Spectroscopy (MAXDOAS), Utqiagvik (Barrow), Alaska 2012-2018</title>
...

Aside from the xsi:schemaLocation error in antibozo's proffered declaration, it is correct. Here is another correct declaration that does not try to address his other observations:

<eml xmlns="eml://ecoinformatics.org/eml-2.1.1" xmlns:stmml="http://www.xml-cml.org/schema/stmml-1.1" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" packageId="doi:10.18739/A2222R550" system="https://arcticdata.io" xsi:schemaLocation="eml://ecoinformatics.org/eml-2.1.1 eml.xsd">

When I change the declaration of the EML namespace thus, and remove the eml: prefix from the root node, then the namespace works as you think it should. That is, a search for '//x:eml' finds the root node, a search for '//x:dataset' finds the child "dataset" node of the root node, and a search for "//x:para" in my example document (the science metadata from doi:10.18739/A2222R550) finds 28 matches.

Another possible implementation is this:

<eml:eml xmlns:eml="eml://ecoinformatics.org/eml-2.1.1" xmlns:stmml="http://www.xml-cml.org/schema/stmml-1.1" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" packageId="doi:10.18739/A2222R550" system="https://arcticdata.io" xsi:schemaLocation="eml://ecoinformatics.org/eml-2.1.1 eml.xsd">
  <eml:dataset>
   ...

Where every node in the EML namespace has the prefix "eml". That is also a correct implementation.

antibozo commented 5 years ago

In

<eml:eml
    packageId="eml.1.1" system="knb"
    xmlns:eml="eml://ecoinformatics.org/eml-2.1.1"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="eml://ecoinformatics.org/eml-2.1.1 eml.xsd">

you aren't declaring a namespace for child elements; you are declaring only a namespace prefix, which you apply solely to the root element. The default namespace of child elements is unaffected by a declaration of this kind, and remains the empty namespace. To put child elements in the EML namespace in this syntax, you must add a namespace prefix to them just as you have done with the root element, e.g. eml:title.

To declare a default namespace, you must write something like:

<?xml version="1.0"?>
<eml
    packageId="eml.1.1" system="knb"
    xmlns="http://ecoinformatics.org/eml-2.1.1"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="http://ecoinformatics.org/eml-2.1.1.xsd">

Note the xmlns=: this is what sets the default namespace. A namespace declared in this way applies to the element in which it is declared (unless that element specifies a namespace prefix) and all child elements that lack a namespace prefix.

To see this, take a sample document and pass it through something that prints out the namespace of each element.

Here is a version of the sample document found at the end of chapter 2 in your specification, with typos corrected ("" in lang attribute and bad closing tags on <para> and <keyword> elements):

<?xml version="1.0"?>
<eml:eml
    packageId="eml.1.1" system="knb" 
    xml:lang="pt_BR"
    xmlns:eml="eml://ecoinformatics.org/eml-2.1.1"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="eml://ecoinformatics.org/eml-2.1.1 eml.xsd">

  <dataset id="ds.1">

    <!-- English title with Portuguese translation -->    
    <title xml:lang="en_US">
        Sample Dataset Description
        <value xml:lang="pt_BR">Exemplo Descrição Dataset</value>
    </title>
    ...
    <!-- Portuguese abstract with English translation -->    
    <abstract>
        <para>
                Neste exemplo, a tradução em Inglês é secundário
                <value xml:lang="en_US">In this example, the English translation is secondary</value>
        </para>
    </abstract>
    ...
    <!-- two keywords, each with an equivalent translation -->    
    <keywordSet>
        <keyword keywordType="theme">
                árvore
                <value xml:lang="en_US">tree</value>
        </keyword>
        <keyword keywordType="theme">
                água
                <value xml:lang="en_US">water</value>
        </keyword>
    </keywordSet>
    ...
  </dataset>
</eml:eml>

and here is a simple XSL that prints out the namespace URI and local name of each element in document order:

<?xml version='1.0' encoding='UTF-8'?>
<xsl:stylesheet
  version='1.0'
  xmlns:xsl='http://www.w3.org/1999/XSL/Transform'
  >

  <xsl:output method='text' media-type='text/plain' encoding='UTF-8' />

  <xsl:template match='/'>
    <xsl:apply-templates select='//*' />
  </xsl:template>

  <xsl:template match='*'>
    <xsl:text>{</xsl:text>
    <xsl:value-of select='namespace-uri()' />
    <xsl:text>}</xsl:text>
    <xsl:value-of select='local-name()' />
    <xsl:text>&#10;</xsl:text> <!-- newline -->
  </xsl:template>

</xsl:stylesheet>

Process the sample XML through this XSL and you will get:

{eml://ecoinformatics.org/eml-2.1.1}eml
{}dataset
{}title
{}value
{}abstract
{}para
{}value
{}keywordSet
{}keyword
{}value
{}keyword
{}value

in which you see that every element except the root eml element has an empty namespace.

If you correct the sample XML to define a default namespace, thus:

<?xml version="1.0"?>
<eml
    packageId="eml.1.1" system="knb" 
    xml:lang="pt_BR"
    xmlns="eml://ecoinformatics.org/eml-2.1.1"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="eml://ecoinformatics.org/eml-2.1.1 eml.xsd">

  <dataset id="ds.1">

    <!-- English title with Portuguese translation -->    
    <title xml:lang="en_US">
        Sample Dataset Description
        <value xml:lang="pt_BR">Exemplo Descrição Dataset</value>
    </title>
    ...
    <!-- Portuguese abstract with English translation -->    
    <abstract>
        <para>
                Neste exemplo, a tradução em Inglês é secundário
                <value xml:lang="en_US">In this example, the English translation is secondary</value>
        </para>
    </abstract>
    ...
    <!-- two keywords, each with an equivalent translation -->    
    <keywordSet>
        <keyword keywordType="theme">
                árvore
                <value xml:lang="en_US">tree</value>
        </keyword>
        <keyword keywordType="theme">
                água
                <value xml:lang="en_US">water</value>
        </keyword>
    </keywordSet>
    ...
  </dataset>
</eml>

and process this through the same XSL, this yields:

{eml://ecoinformatics.org/eml-2.1.1}eml
{eml://ecoinformatics.org/eml-2.1.1}dataset
{eml://ecoinformatics.org/eml-2.1.1}title
{eml://ecoinformatics.org/eml-2.1.1}value
{eml://ecoinformatics.org/eml-2.1.1}abstract
{eml://ecoinformatics.org/eml-2.1.1}para
{eml://ecoinformatics.org/eml-2.1.1}value
{eml://ecoinformatics.org/eml-2.1.1}keywordSet
{eml://ecoinformatics.org/eml-2.1.1}keyword
{eml://ecoinformatics.org/eml-2.1.1}value
{eml://ecoinformatics.org/eml-2.1.1}keyword
{eml://ecoinformatics.org/eml-2.1.1}value

Because of the way you have written this specification, along with EML documents we are actually observing, it is not possible when processing mixed documents to distinguish EML elements from elements in the empty namespace.

As for XML injection, you aren't going to protect people from XML injection by using weird schemes in your URIs. It is up to people to protect themselves with appropriate steps, such as disabling external entity resolution. And, after all, eml: might actually refer to an access protocol someday.

Regardless of the correct value for xsi:schemaLocation, the issue is the weird scheme you have included in the URI.

Yes, we have all done things 20 years ago that needed to be modified as things became clearer. But fixing mistakes is still a good thing. It is commonplace to update namespace URIs when new versions are created, so perhaps this is a good time to do so.

One other note: it is perfectly fine to use http, rather than https, in a namespace URI. It's an identifier, not necessarily a locator.

antibozo commented 5 years ago

One other note about default namespaces:

Again, consider the sample document from chapter 2 of the specification:

<?xml version="1.0"?>
<eml:eml
    packageId="eml.1.1" system="knb" 
    xml:lang="pt_BR"
    xmlns:eml="eml://ecoinformatics.org/eml-2.1.1"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="eml://ecoinformatics.org/eml-2.1.1 eml.xsd">

  <dataset id="ds.1">
  …
  </dataset>

I wrote earlier that elements such as <dataset> are in the empty namespace in this example, but this is not strictly true: instead they are in the default namespace, which i am assuming to be empty. If the EML document is included in another document with its own default namespace, e.g.:

<?xml version="1.0"?>
<foo xmlns='http://foo.example.org'>
  <eml:eml
    packageId="eml.1.1" system="knb" 
    xml:lang="pt_BR"
    xmlns:eml="eml://ecoinformatics.org/eml-2.1.1"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="eml://ecoinformatics.org/eml-2.1.1 eml.xsd">

    <dataset id="ds.1">
    …
    </dataset>
  </eml:eml>
<foo>

then <dataset> will be found in the http://foo.example.org namespace, and processors that expect to find it in the empty namespace will fail.

Hopefully it is altogether clear by now why this is real flaw in the specification.

antibozo commented 5 years ago

… and, following the concern about the weird scheme: again, it is possible eml: could become an access protocol or otherwise defined URI scheme someday as codified in an RFC having nothing to do with markup of ecological metadata. That RFC may constrain the syntax of any URI with an eml: scheme in a way that makes the URIs you have used in your specification non-compliant with an Internet standard, and at the very least will make these URIs meaningless or misleading.

Use http:. It is what one does.

csjx commented 5 years ago

Hi Jeff and John,

Ah, yes, I stand corrected on the default namespace issue. After reading your explanations, and looking at this more closely, I see that only the root element is namespaced in these instance documents, and that the children have no namespace. This perplexed me a bit since the Xerces-J parser we tend to use to validate these documents validates them fine, and in fact they are completely valid with regard to adhering to the EML schema. As you point out, the main issue is that we (Arctic Data Center) omitted the default namespace attribute on the <eml> element. While it doesn't matter for whole-document validation, it makes it harder to search for (as you pointed out), and to individually validate fragments of the document based on their intended schema sub-modules (for instance grabbing the /eml/dataset/coverage fragment and validating it against the eml-coverage.xsd schema). So, my understanding is that this isn't really an EML schema issue, but it is rather an authoring issue on our part (and likely other EML document authors). When I add just the default namespace attribute of xmlns="eml://ecoinformatics.org/eml-2.1.1" to an instance document, and transform the document using Jeff's XSLT above using Xalan Java, I get:

{eml://ecoinformatics.org/eml-2.1.1}eml
{eml://ecoinformatics.org/eml-2.1.1}dataset
{eml://ecoinformatics.org/eml-2.1.1}title
{eml://ecoinformatics.org/eml-2.1.1}creator
{eml://ecoinformatics.org/eml-2.1.1}individualName
{eml://ecoinformatics.org/eml-2.1.1}givenName
{eml://ecoinformatics.org/eml-2.1.1}surName
{eml://ecoinformatics.org/eml-2.1.1}organizationName
{eml://ecoinformatics.org/eml-2.1.1}positionName
{eml://ecoinformatics.org/eml-2.1.1}address
{eml://ecoinformatics.org/eml-2.1.1}deliveryPoint
{eml://ecoinformatics.org/eml-2.1.1}city
...

It doesn't require any namespace change to the root <eml:eml> element, and validates as well. So, I'm wondering if your libxml2 library treats it any differently?

So, I will bring this up with the ADC group, and we'll discuss adding in the default namespace attribute in future documents. That said, for your processing purposes, I think you will need to inject the xmlns="eml://ecoinformatics.org/eml-2.1.1" attribute into your DOM prior to processing the document in your Perl code so that you can use find() as intended.

So, before we close this issue, I think the summary of the three items raised are:

  1. There is a proposal here to change future EML releases to use a namespace URI scheme beginning with http: as opposed to eml:. I'm fine with this in theory, but we need to hear from others in the community, particularly @mbjones, @vdave, @laurenwalker, @mpsaloha, @mobb, @amoeba , @cboettig, and many others that may have both opinions and/or software that relies on the namespace scheme as it is currently. We would also need to discuss any backward-compatibility issues. Depending on the opinions of the community, we may create a new distinct ticket for this issue.
  2. The use of resolvable xsi:schemaLocation namespace/location pairs is a per-author decision, and can practically only be used as a hint to dereference a schema. I think this is more of a group-by-group decision, and so there's nothing to resolve here per se, although I certainly acknowledge the convenience of it, and the problems that it can raise as well.
  3. The lack of a default namespace attribute in instance documents you are processing is problematic for your use case, but there's nothing inherently wrong with the EML schema, nor the instance documents from a validation perspective. This, too, is a group-by-group authoring issue, and so can't be resolved here. That said, as you point out, the default namespace is intended to be the EML namespace, and setting it as such would likely help you and other consumers of EML documents. So, I will bring this up with groups I'm aware of that produce EML, and I invite others to comment on this as well to help come to a better community authoring convention. Again, thanks for your insightful comments, persistence, and clear examples.
antibozo commented 5 years ago
  1. Regardless of whether you make xsi:schemaLocation resolvable, i strongly advise you against using an undefined scheme that could someday become defined. If you don't want something that looks like a URL to resolve to an actual document, simply do not put an actual document at that URL. Or use a URI that is not a URL but conforms to an existing defined scheme such as urn:.

  2. The lack of a default namespace means that all child elements in an EML document are in whatever namespace governs their context. The real problem here, however, is that the child elements are not in the originally intended namespace. Adding a default namespace is one way to change this, and using a namespace prefix is another; i'll discuss this further down.

The immediate problem is that any use case that uses namespaces to locate elements will have a problem with putting the child elements in the originally intended namespace. You have both a backward and forward compatibility issue. If someone has, for example, written an EML processor that searches for {}dataset elements because that's what has been coming over the transom until now, then that processor will stop finding the elements it was looking for because they are now {eml://ecoinformatics.org/eml-2.1.1}dataset elements. If, on the other hand, you leave things as they are, and someone is taking the declared intent in the specification that "The eml module is a wrapper container that allows the inclusion of any metadata content in a single EML document" at its word, and embedding an eml:eml inside, perhaps, another element that has a default namespace, say http://foo.example.com, perhaps inside another eml:eml element, then <dataset> inside that interior eml:eml element will be {http://foo.example.com}dataset, while <dataset> in the outer eml:eml element will be {}dataset.

The correct thing to do, i believe, is to explicitly say that all elements defined in the EML specification are in the EML namespace. The usual way to make this clear in a specification is to use a namespace prefix on every element, and not to rely on default namespaces, because default namespaces have scope. I.e., your sample record would look like:

<?xml version="1.0"?>
<eml:eml
    packageId="eml.1.1" system="knb" 
    xml:lang="pt_BR"
    xmlns:eml="http://ecoinformatics.org/eml-2.2"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="http://ecoinformatics.org/eml-2.1.1 eml.xsd">

  <eml:dataset id="ds.1">
    <eml:title xml:lang="en_US">
…

To be clear, this is going to break existing processors that were handling the EML records that actually exist by finding child elements in the empty namespace. You'll see in this example i have also increased the version number in the namespace URI to accommodate this.

csjx commented 5 years ago

Heh - disregard that last comment (I deleted it) - wrong ticket. 🙂

mbjones commented 5 years ago

Clarifications

A couple of points of clarification, just so we are on the same page:

I've spent some time looking at this proposal, and talking to a few folks about it. While our current use of elementFormDefault=unqualified is perfectly legitimate use of XML Schema (and is the default), I think we agree that changing it to allow qualified child elements would be helpful. The challenge is in how to do this in as compatible a way as possible to avoid breaking existing tools that rely on the current approach in EML. In particular, most of our editing applications (MetacatUI, Morpho) and our display stylesheets in XSLT have hundreds of encoded XPaths that reference local children elements in the empty namespace that would need to be updated to be namespace aware should we make this change. Thus, my conclusion is that this change to qualified elements would be backwards incompatible, and so does not belong in the EML 2.2 release which is meant to be fully compatible with previous releases.

Proposal

That said, I think its a good idea for a 3.0.0 release, and here is what I think we should change.

  1. consolidate all modules to use a single http-based namespace (rather than separate namespaces for each module)
  2. change elementFormDefault to qualified
  3. Add a default namespace at the root level of documents so all elements are in that namespace, or alternatively, qualify them using prefixes
    • this will mean that child elements can still be "bare" in that they don't have to be explicitly prefixed and would inherit the namespace from the default, so most text documents would not need to be updated beyond the root element namespace declarations

I think at that point, people could set a default namespace on their documents and it would be used for any elements not explicitly prefixed. The only place it would need to be changed would be in additionalMetadata when other namespaces are used such as STMML. That might cause some validation issues with existing documents that we would need to thoroughly test, but I think it would be workable.

One outstanding question for me is:

In any case, due to the compatibility issue, I will retarget this to milestone 3.0.0.

Comments appreciated.