averbraeck / opentrafficsim

Open Source Multi-Level Traffic Simulator
BSD 3-Clause "New" or "Revised" License
28 stars 8 forks source link

Research options for XML Schema 1.1 parsing and validation #81

Closed averbraeck closed 11 months ago

averbraeck commented 12 months ago

Currently, Eclipse does not seem to support XML Schema 1.1. We might need this to parse a string that can either be an expression, or a reference to an existing element or a constant. The xsd:keyref and xsd:key elements do not seem to be able to handle this diversity using XML Schema 1.0 -- but this is to be researched as well.

This leads to two solution paths:

  1. solve non-matching keys using XML Schema 1.0
  2. include editor, validator and parser for XML Schema 1.1
averbraeck commented 12 months ago

In terms of parsing, Xerces-J 2.12.2 implements XML Schema 1.1 with XPath 2.0. This means that we can parse XML-files according to an XML Schema version 1.1. See https://xerces.apache.org/xerces2-j/releases.html. However, the following important ingredients are missing:

Uptake in other projects and languages is also slow or non-existent. .NET does not support XSD 1.1, see https://stackoverflow.com/questions/61293382/xsd-1-1-validation-for-both-java-and-net-c. C++ has no (free) support, only commercial external libraries, see https://stackoverflow.com/questions/13057249/c-implementation-of-xml-schema-xsd-1-1. In Python it has become possible since 2022: https://stackoverflow.com/questions/19809141/is-it-possible-to-validate-an-xml-file-against-xsd-1-1-in-python. So, uptake of this 2009 standard is extremely slow, with a few recent updates, and most libraries are commercial.

This points into the direction of NOT using XML Schema 1.1 at the moment.

averbraeck commented 12 months ago

Can we accomplish what we want using the XSD Schema standard version 1.0? Up to a certain extent, this would be possible, but it would restrict the user a little bit (but, IMHO, not in a bad way). Let's break down very precisely what we try to accomplish in our XML files, for which XML Schema 1.1 might be a solution.

Let's zoom in on the last case.

Based on the above restrictions: (1) no expressions for keyrefs which sounds totally reasonable, and (2) define the variables in XML using curly braces, which is, in my opinion, not an issue, the validation of key/keyref combinations with expressions can be addressed using XML Schema 1.0. Even better: the expressions are really checked, and a validation error is given when the defined variable between curly braces does not exist. Pick lists will show both the defined keys (e.g., nodes) and a list of the defined variables in the scenario as an extra benefit.

Below, the corresponding XSD definitions will be shown.

averbraeck commented 12 months ago

Suppose we have a simple xsd-file as an example that defines two generic simple types:

  <xsd:simpleType name="VariableType">
    <xsd:restriction base="xsd:string">
      <xsd:pattern value="\{[A-Za-z][A-Za-z0-9_\-\.%!@#\^]*\}"></xsd:pattern>
    </xsd:restriction>
  </xsd:simpleType>

  <xsd:simpleType name="IdType">
    <xsd:restriction base="xsd:string">
      <xsd:pattern value="[A-Za-z][A-Za-z0-9_\-\.%!@#\^]+"></xsd:pattern>
    </xsd:restriction>
  </xsd:simpleType>

Suppose we define a NodeType as an id or a variable

  <xsd:simpleType name="NodeType">
    <xsd:union memberTypes="test:IdType test:VariableType" />
  </xsd:simpleType>

Now, we define a Variable, Node, and Link with references to a start node and end node:

  <xsd:element name="Variable">
    <xsd:complexType>
      <xsd:attribute name="Id" type="test:VariableType" use="required" />
    </xsd:complexType>
  </xsd:element>

  <xsd:element name="Node">
    <xsd:complexType>
      <xsd:attribute name="Id" type="test:IdType" use="required" />
    </xsd:complexType>
  </xsd:element>

  <xsd:element name="Link">
    <xsd:complexType>
      <xsd:attribute name="Id" type="test:IdType" use="required" />
      <xsd:attribute name="NodeStart" type="test:NodeType" use="required" />
      <xsd:attribute name="NodeEnd" type="test:NodeType" use="required" />
    </xsd:complexType>
  </xsd:element>

and finally a Network consisting of variables, nodes and links:

  <xsd:element name="Network">
    <xsd:complexType>
      <xsd:sequence>
        <xsd:choice minOccurs="0" maxOccurs="unbounded">
          <xsd:element ref="test:Node" minOccurs="0" maxOccurs="1" />
          <xsd:element ref="test:Link" minOccurs="0" maxOccurs="1" />
          <xsd:element ref="test:Variable" minOccurs="0" maxOccurs="1" />
        </xsd:choice>
      </xsd:sequence>
    </xsd:complexType>
  </xsd:element>

The next post will show how to define the keys and keyrefs.

averbraeck commented 12 months ago

The xsd:key and xsd:keyref entries to enforce that the NodeStart and NodeEnd tags in the Link either contain a node id (no curly braces), or a variable name between curly braces, can be defined as follows:

    <xsd:key name="nodeKey">
      <xsd:selector xpath=".//test:Network/test:Node|.//test:Network/test:Variable" />
      <xsd:field xpath="@Id" />
    </xsd:key>

    <xsd:keyref name="linkNodeStartNodeIdRef" refer="test:nodeKey">
      <xsd:selector xpath=".//test:Network/test:Link" />
      <xsd:field xpath="@NodeStart" />
    </xsd:keyref>

    <xsd:keyref name="linkNodeEndNodeIdRef" refer="test:nodeKey">
      <xsd:selector xpath=".//test:Network/test:Link" />
      <xsd:field xpath="@NodeEnd" />
    </xsd:keyref>

The vertical bar (or-operator) indicates that the key for a node is either a node id, or a variable id.

This works nicely. When the variable special-node is defined:

  <test:Variable Id="{special-node}" />

and regular nodes are defined, e.g.,

  <test:Node Id="TREC" />

then we can validly define a link as follows:

  <test:Link Id="ECSC" NodeStart="TREC" NodeEnd="{special-node}" />

This validates correctly. Any change in either the node name or the variable name will render the XML invalid.

averbraeck commented 12 months ago

This has been tested with the Eclipse editors, and it works flawlessly. Parsing with Xerces or JAXB/XJC will work fine as well.

WJSchakel commented 12 months ago

Great, this seems like a solid solution. An open question is how we define the input parameter types. If "NAME" is some valid variable name, the input parameters that will be used as ID replacement, will themselves have an ID of "{NAME}".

  1. Will all String input parameters have such ID's?
  2. Will all other input parameters have such ID's?

In both of these cases, the braces are not strictly required. If either is answered with yes, then "NAME" needs to be a valid variable name in the expression editor. For consistency and clarity I'd say 'yes' to both questions.

WJSchakel commented 12 months ago

The above is topic of issue #83.