Research options for XML Schema 1.1 parsing and validation

averbraeck commented 12 months ago

Currently, Eclipse does not seem to support XML Schema 1.1. We might need this to parse a string that can either be an expression, or a reference to an existing element or a constant. The xsd:keyref and xsd:key elements do not seem to be able to handle this diversity using XML Schema 1.0 -- but this is to be researched as well.

This leads to two solution paths:

solve non-matching keys using XML Schema 1.0
include editor, validator and parser for XML Schema 1.1

averbraeck commented 12 months ago

In terms of parsing, Xerces-J 2.12.2 implements XML Schema 1.1 with XPath 2.0. This means that we can parse XML-files according to an XML Schema version 1.1. See https://xerces.apache.org/xerces2-j/releases.html. However, the following important ingredients are missing:

Support in Eclipse for validation in the editor. There is currently no free and stable solution for Eclipse, although it has been requested since 2009. Paid solutions like Oxygen exist, but they are no option for our project.
Support in JAXB / XJC for generating the Java classes in accordance with XML Schema 1.1. See the open issue at https://github.com/eclipse-ee4j/jaxb-ri/issues/1176 (and many issues before this -- dating back to 2009). OpenJDK won't fix this either, see https://bugs.openjdk.org/browse/JDK-8197490.
OpenJDK has a built-in Xerces-J parser, but without Schema 1.1 support (see https://bugs.openjdk.org/browse/JDK-8242470 and https://bugs.openjdk.org/browse/JDK-8197490 that show that a 'stripped' version of Xerces-J is used within the OpenJDK). This can, of course, be solved with using an explicit and full Xerces-J library.

Uptake in other projects and languages is also slow or non-existent. .NET does not support XSD 1.1, see https://stackoverflow.com/questions/61293382/xsd-1-1-validation-for-both-java-and-net-c. C++ has no (free) support, only commercial external libraries, see https://stackoverflow.com/questions/13057249/c-implementation-of-xml-schema-xsd-1-1. In Python it has become possible since 2022: https://stackoverflow.com/questions/19809141/is-it-possible-to-validate-an-xml-file-against-xsd-1-1-in-python. So, uptake of this 2009 standard is extremely slow, with a few recent updates, and most libraries are commercial.

This points into the direction of NOT using XML Schema 1.1 at the moment.

averbraeck commented 12 months ago

Can we accomplish what we want using the XSD Schema standard version 1.0? Up to a certain extent, this would be possible, but it would restrict the user a little bit (but, IMHO, not in a bad way). Let's break down very precisely what we try to accomplish in our XML files, for which XML Schema 1.1 might be a solution.

Expressions have been introduced for most value fields in our XML files. This means that expressions can be used instead of a value using curly braces. So, instead of 6.28 rad, we can now specify {2*PI() [rad]}. Using a xsd:union tag, this is easily solved within the XSD specification.
Variables have been introduced that can be set in a scenario, and used in the above expressions. So, a variable called {maxspeed} can be defined with a certain value, and {maxspeed} can subsequently be used as the value of a field, or within an expression such as {maxspeed + 10.0[km/h]}, or {1.1 * maxspeed}. (The djutils evaluator knows that if a name is followed by (, it is a function or constant; if not, it is a variable name). The XSD does not define what an expression should look like, it rather indicates that a value is either a string without curly braces, indicating a constant value rather than an expression, or a string starting and ending with curly braces and no curly braces inside, indicating an expression that needs to be evaluated.
Variables can be used in some places where an xsd:key is expected. This is the one causing the problem -- the key cannot be validated against either the defined keys or the expressions, since the expressions are not in the key-list, and can contain any information.

Let's zoom in on the last case.

Expressions for keys are strings, not values.
The expression evaluator in djutils can only evaluate numerical expressions, and is not a string handler.
this means that the only expressions that we can use for a key value are names of variables between curly braces, such as {startNodeId}.
This simplifies matters a LOT. All variables that exist are defined in the scenario.
So, if an XML tag expects a key value that is checked against definitions using a keyref, we need to explore the combination of two lists rather than one: the value of the field should either be in the defined keys (for the example, the list of defined nodes), or in the list of defined variables.
Since an xsd:key can be defined from multiple XPath strings that are combined into one list, we can combine the names of the id's that can be chosen and the defined variables.
In order to make this work, variables in the scenario should be defined INCLUDING the start and end curly brace, otherwise they cannot be matched against what is present in the keyref field.

Based on the above restrictions: (1) no expressions for keyrefs which sounds totally reasonable, and (2) define the variables in XML using curly braces, which is, in my opinion, not an issue, the validation of key/keyref combinations with expressions can be addressed using XML Schema 1.0. Even better: the expressions are really checked, and a validation error is given when the defined variable between curly braces does not exist. Pick lists will show both the defined keys (e.g., nodes) and a list of the defined variables in the scenario as an extra benefit.

Below, the corresponding XSD definitions will be shown.

averbraeck commented 12 months ago

Suppose we have a simple xsd-file as an example that defines two generic simple types:

a VariableType as a string that starts and ends with a curly brace; inside the curly braces it starts with a letter and has no further braces, curly braces, spaces, or other difficult characters inside;
an IdType that starts with a letter and has no further braces, curly braces, spaces, or other difficult characters inside.

  <xsd:simpleType name="VariableType">
    <xsd:restriction base="xsd:string">
      <xsd:pattern value="\{[A-Za-z][A-Za-z0-9_\-\.%!@#\^]*\}"></xsd:pattern>
    </xsd:restriction>
  </xsd:simpleType>

  <xsd:simpleType name="IdType">
    <xsd:restriction base="xsd:string">
      <xsd:pattern value="[A-Za-z][A-Za-z0-9_\-\.%!@#\^]+"></xsd:pattern>
    </xsd:restriction>
  </xsd:simpleType>

Suppose we define a NodeType as an id or a variable

  <xsd:simpleType name="NodeType">
    <xsd:union memberTypes="test:IdType test:VariableType" />
  </xsd:simpleType>

Now, we define a Variable, Node, and Link with references to a start node and end node:

  <xsd:element name="Variable">
    <xsd:complexType>
      <xsd:attribute name="Id" type="test:VariableType" use="required" />
    </xsd:complexType>
  </xsd:element>

  <xsd:element name="Node">
    <xsd:complexType>
      <xsd:attribute name="Id" type="test:IdType" use="required" />
    </xsd:complexType>
  </xsd:element>

  <xsd:element name="Link">
    <xsd:complexType>
      <xsd:attribute name="Id" type="test:IdType" use="required" />
      <xsd:attribute name="NodeStart" type="test:NodeType" use="required" />
      <xsd:attribute name="NodeEnd" type="test:NodeType" use="required" />
    </xsd:complexType>
  </xsd:element>

and finally a Network consisting of variables, nodes and links:

  <xsd:element name="Network">
    <xsd:complexType>
      <xsd:sequence>
        <xsd:choice minOccurs="0" maxOccurs="unbounded">
          <xsd:element ref="test:Node" minOccurs="0" maxOccurs="1" />
          <xsd:element ref="test:Link" minOccurs="0" maxOccurs="1" />
          <xsd:element ref="test:Variable" minOccurs="0" maxOccurs="1" />
        </xsd:choice>
      </xsd:sequence>
    </xsd:complexType>
  </xsd:element>

The next post will show how to define the keys and keyrefs.

averbraeck commented 12 months ago

The xsd:key and xsd:keyref entries to enforce that the NodeStart and NodeEnd tags in the Link either contain a node id (no curly braces), or a variable name between curly braces, can be defined as follows:

    <xsd:key name="nodeKey">
      <xsd:selector xpath=".//test:Network/test:Node|.//test:Network/test:Variable" />
      <xsd:field xpath="@Id" />
    </xsd:key>

    <xsd:keyref name="linkNodeStartNodeIdRef" refer="test:nodeKey">
      <xsd:selector xpath=".//test:Network/test:Link" />
      <xsd:field xpath="@NodeStart" />
    </xsd:keyref>

    <xsd:keyref name="linkNodeEndNodeIdRef" refer="test:nodeKey">
      <xsd:selector xpath=".//test:Network/test:Link" />
      <xsd:field xpath="@NodeEnd" />
    </xsd:keyref>

The vertical bar (or-operator) indicates that the key for a node is either a node id, or a variable id.

This works nicely. When the variable special-node is defined:

  <test:Variable Id="{special-node}" />

and regular nodes are defined, e.g.,

  <test:Node Id="TREC" />

then we can validly define a link as follows:

  <test:Link Id="ECSC" NodeStart="TREC" NodeEnd="{special-node}" />

This validates correctly. Any change in either the node name or the variable name will render the XML invalid.

averbraeck commented 12 months ago

This has been tested with the Eclipse editors, and it works flawlessly. Parsing with Xerces or JAXB/XJC will work fine as well.

WJSchakel commented 12 months ago

Great, this seems like a solid solution. An open question is how we define the input parameter types. If "NAME" is some valid variable name, the input parameters that will be used as ID replacement, will themselves have an ID of "{NAME}".

Will all String input parameters have such ID's?
Will all other input parameters have such ID's?

In both of these cases, the braces are not strictly required. If either is answered with yes, then "NAME" needs to be a valid variable name in the expression editor. For consistency and clarity I'd say 'yes' to both questions.

WJSchakel commented 12 months ago

The above is topic of issue #83.

averbraeck / opentrafficsim

Research options for XML Schema 1.1 parsing and validation #81