Schematron / schematron-enhancement-proposals

This repository collects proposals to enhance Schematron beyond the ISO specification
9 stars 0 forks source link

Support invisible XML #48

Open rjelliffe opened 2 years ago

rjelliffe commented 2 years ago

Background

Invisible XML is simple system for a deterministic context-free transducer (specified with a non-deterministic context-free attribute grammar) that is worthwhile supporting

IXML can be considered useful both in itself and as a good example of a class of processing.

Scenarios

Obviously a non-XML document converted using an iXML grammar into XML can be validated with Schematron. And a Schematron engine could have its own method to detect and convert a non-XML document and run the conversion, presenting the result to the Schematron validation

However, there are three other scenarios.

  1. We want to be able to validate a non-XML document directly, and we want the grammar to be used to be part of the Schematron schema, either directly or by a name.

  2. We want for an sch:pattern/@document reference to, if it downloads a non-XML resource, convert the document to XML.

  3. We want to be able to take some node value (such as an attribute's value), convert it to XML and have that XML available in a variable,

3a. We want to take that variable and validate patterns in it.

Also, SVRL needs to be adjusted to cope.

Proposal

SVRL

As an initial minimal approach to leaves as much flexibility for implementers as possible, I propose to augment SVRL with svrl:active-pattern/svrl:conversion-failure, which is a container element that can contain any message from the parser. (As with URL retrieval failures, we are rather at the mercy of the library and implementation for the quality and user-targetting of the error message.)

See #47 for info.

Schematron

1) Main document from ixml

Schematron is augmented by a top-level element sch:schema\sch:conversion which registers a converter name for a MIME type or extension. This can be inline or by a reference.

  <sch:conversion is="somename"  mime-type="text/*"  convert-as="ixml" >
      .. ixmlt gramar here
  </sch:conversion>

or

   <sch:conversion id="somename" mime-type="text/*" covert-as="ixml"  href="URL or file relative to schema" />

As well as, or in addition to @mime-type we allow @filename to match on the filename by regex e.g. *.ixml. Perhaps we can, for UNIXy reasons, allow @magic to look at the initial bytes of the file.

The sch:schema element is augmented by an attribute @use-conversion which provides the conversion to use, e.g.

   <sch:schema ... use-conversion="USB-address" />

2) Pattern on external document from ixml

The sch:schema element is augmented by an attribute @use-conversion which provides the conversion to use.

   <sch:pattern ...  document=" 'http://eg.co/po1.txt' "  use-conversion="PurchaseOrder" />

@use-conversion can only be used if @document is present. If the retrieved resource is MIME type /xml then no conversion is performed (and an implementation determined warning generated.)

If there are two patterns with the same URL and conversion, the document should be re-used not re-retrieved.

3) Parse node value into variable

To read some text from a node into a variable and convert it, the sch:let element is augmented by an attribute @use-conversion which provides the conversion to use.

<sch:let name="xxxx" value=" @thing "  use-conversion="thing-parser"   />   

3a) Validate variables using patterns

However, there is no obvious mechanism to make patterns validate a variable's value. That is a more general facility that would be a separate proposal, probably only needed if this proposal is accepted.

Examples for 3) Parse node value into variable

There are numerous examples of complex data formats used for attribute and data: URLs, and even CSV. There are many cases where it is no practical or desirable to represent the atomic components of some complex data using elements: because of verbosity, for example, or because there is an industry standard idiom or notation that is what is being marked up.

Currently, Schematron fails in its core task of finding patterns in documents, whenever the document contains these complex field values.

ISO 8601

Our document is a large book catalog, where each book has a date using ISO8601. This is not the subset of used by XSD, but the full ISO8601 date format. So, we have an element like

<book 
    author="Erasmus" 
    author-life-span="%1466-%10/1536-07-23"  
    author-active-date="%1499-X/1536-07-?23"  
    creation-date="1523-X/?1524-X"  
    publication-date="2022-01-01"   > ...</book>

(For ISO8601, the % means approximate, the ? means uncertain, the X is a wildcard, the / is a date range; it allows omitting the day. Things like timezones etc not shown.)

We want to validate that the author-active-date range fits in the author-life-span range, that the creation date fits in the author-active range, and that the publication-date is later than the creation-date. We have a converter for complete ISO8601 date to XML (whether this is iXML or some regex converter is not material) so we can have the complex expressions sitting as nice sets of XDM nodes.

<sch:schema   ... >
      <sch:conversion id="ISO8601"  covert-as="ixml"  href="notations/ISO8601-date.txt"  />
   ...
  <sch:pattern>
      <sch:rule context="book">
         <sch:let name="author-life-span-as-XML"    select="@author-life-span" use-conversion="ISO8601" />
         <sch:let name="author-active-date-as-XML"  select="@author-active-date" use-conversion="ISO8601" />
         <sch:let name="creation-date-as-XML"       select="@creation-date"      use-conversion="ISO8601" />
         <sch:let name="publication-date-as-XML"  select="@publication-date" use-conversion="ISO8601" />

         <sch:assert test="(number($author-active-date-as-XML/date/range/from/year) 
                                    >=   number($author-life-span-as-XML/date/range/from/year))
                             and  (number($author-active-date-as-XML/date/range/to/year) 
                                    &lt;l= number($author-life-span-as-XML/date/range/to/year)) "
          >The author-active-date range should fit in the author-life-span range</sch:assert>

         <sch:assert test="(number($creation-date-as-XML/date/range/from/year) 
                                    >=   number($author-active-date-as-XML/date/range/from/year))
                             and  (number($creation-date-as-XML/date/range/to/year) 
                                    &lt;l= number($author-active-as-XML/date/range/to/year))"
          >The creation date should fit in the author-active range</sch:assert>

         <sch:assert test="number($publication-date-as-XML/date/range/from/year) 
                                   > number($creation-date-as-XML/date/range/from/year)"
          >The publication-date should be later than the creation-date.</sch:assert>

And we can go on making the tests better, without having to worry about how to parse the data

Example: XPaths

For Schematron itself, we have many XPaths. Schematron validation has been held back because validators do not check the XPaths.

The Schematron schema for Schematron could invoke the converter for the XPaths and do various kinds of validation. For example, in this example we check that we are not using XSLT3 XPaths novelties when our schema query language binding advertises the schema as only requiring XSLT1 or XSLT2.

<sch:schema   ... >
      <sch:conversion id="XPath"  covert-as="ixml"  href="notations/Xpath3-1.txt"  />
      ...
       <sch:pattern ="XSLT1-exclusions">
           <sch:rule context="sch:rule[/sch:schema[@qlb='xslt1' or @qlb='xslt2'] ">
                   <sch:let name="test-as-XML"    select="@test" use-conversion="XPath" /> 
                   <sch:report test="$test-as-XML//token[@value='function']"
                    >XSLT1 and XSLT2 do not allow function definitions in XPaths</sch:report>

Example: Land Points

A mapping system specifies areas of land by surfaces bounded by some number of points, where the points have a northerly, easterly, and elevation value.

These are specified in a whitespace separated list: N0 E0 H0 N1 E1 H1 ... Nn En Hn

<LandXML...>
...
<Surfaces ...>
<Surface>
<Definition ...>
<Pnts>
<P id="XYZ">30 10 20 40 80 110  40 85 6 32 12 24</P>
<P .../>
</Pnts>
<Faces>
<F .../>
</Faces>
</Definition>
...
</Surface>
...
</Surfaces>
...
</LandXML>

We want to make sure that none of the points in the polygon overlap. We want to do this by exposing the data as tuple, rather than hiding it behind some complex function.

Method: again, we define an iXML grammar that converts the P element into a variable as

<points> 
 <point N="3" E="10" H="20" />
 <point N="40" E="80" H="110" />
 <point N="40" E="85" H="6" />
 <point N="32" E="12" H="24" />
</points>

which is very explicit for validation.

(I note that in fact using Schematron to validate geometry is a real application: the intersection of flight routes over Europe, being the example I was informed of. )

Validate styles from CSS stylesheet

We are validating an XHTML document. It has a linked CSS stylesheet. We want to confirm that the CSS has selectors for all the stylenames used in the XHTML.

So we have a CSS parser in iXML (or whatever). So we read the document in (as a string: if XPath does not support this, a standard function should be made, presumably.

<sch:schema   ... >
      <sch:conversion id="CSS"  covert-as="ixml"  href="notations/CSS.txt"  />
      <sch:let  id="Stylesheet-uri" 
                     value="/html/head/link[@type='text/css'][1]/@ref " />

      <sch:let  id="Stylesheet-uri" 
                     value="extension:download-as-text( $Stylesheet-uri)" use-conversion="CSS" " />

      <sch:pattern>
         <sch:rule context="*[@class]">
                   ... do the validation here

So we have our CSS file as a top-level variable, as XML. The Schematron rules then handle looking up in that data.

(Of course, wild CSS has other issues: included stylesheets and so on. Being able to parse a stylesheet means that such things can start to be addressed, rather than us being stymied at the start.)

Example 2) Pattern on external document from ixml

Most of the Schematron projects I have been involved in over the years have involved AB testing: either testing that the information that was in the input document is also in the transformed document mutatis mutandis, or that when a document is converted then round-tripped back, it has the equivalent information as far as can be.

Database migration validation

Recently, I had a variation on this AB testing. A large complex organization web-publishes large complex XML dumps of its databases, produced by a large complex pipeline. They had lost confidence with passage of years and rust and moth, and decided that prudence dictated they make smaller chunks of data available using JSON and CSV (as well as an XML).

However, for a particular reason, they did not have access to the code that produced the big XML. So they wanted to cross check their new JSON/CSV API against the XML data dumps. For a particular reason, they were not interested in backward compatability (for all the data in the XML, does it match the JSON/CSV API?) but on forward compabitility (for all the data in the new JSON/CSV API does it match the XML.)

With the current proposal, this could be handled in Schematron like this:

<sch:schema ... >

   <!-- Specify the kind of conversion and the script -->
   <sch:conversion id="CSV" mime-type="text/*" covert-as="ixml"  href="notations/CSV-converter.txt" />

   <!-- Give the primary XML document a name, so it can accessed in patterns over external documents -->
   <sch:let name="xmlDocument" value="/*" as="element()"/>

   <!-- This pattern reads in the external  CSV document, converting it to XML, then validates it -->
   <sch:pattern ...  
          document=" 'https://eg.farm.gov.xx/datamart/yokel-list?characteristic=slack-jawed' " 
          use-conversion="CSV" > 

           <sch:rule context="/CSV/row">

                   <sch:assert test=" $xmlDocument//yokel/@hog-count = cell[1]"
                    >The value of the first cell of each row should be the same as the
                    yokel's hog-count in the XML</sch:assert>
tgraham-antenna commented 2 years ago

Should -- and if so, how would -- iXML support be an optional feature of an implementation?

rjelliffe commented 2 years ago

Should -- and if so, how would -- iXML support be an optional feature of an implementation?

Definitely. An implementation does not need to support any particular converter at all. (Indeed, they could "implement" it by just failing with a schema error if the Schematron schema has a sch:conversion element.)

And, despite the heading, the markup suggested is for a general mechanism, not tied to iXML specifically: iXML is just the motivating example. The sch:conversion element etc can be used for anything that takes resource (i.e. a string) and converts it to XML. For example, the same mechanism could be used to read in encrypted files into a variable (decrypted and parsed as XML), where the content of the sch:conversion is a public key or whatever.

So the markup does two things:

1) A human can know, by looking at schema, that in order to run the validation, their Schematron system needs to support the kind of converter indicated in sch:conversion\@convert-as. (There are no hidden gotchas, such as can occur when people use java: foreign functions in an XPath in some out-of-the-way location in their Schematron schema.)

For example, say some HL7 schema uses information that is stored in JSON: our schema specifies that it requires ixml to read this document into a variable as XML. Any developer charged with developing the validation system looks at the schema, and sees that it needs an iXML converter. They select or develop their system accordingly.

2) The Schematron system can know whether to generate a schema error (if it does not have a converter registered for that sch:conversion\@convert-as ) or otherwise where to locate the auxiliary information (such as a grammar) needed for the conversion.

So an implementation would register a converter for some common name: e.g. for "ixml". If the schema comes along using some different name (e.g. "InvisibleXML") then the implementation needs configuration for that.

(N.b. I think using URLs or MIME types for naming the converter would be over-engineering: the problem URLs solve is name clashes, not what a name or URL corresponds to.)

AndrewSales commented 2 years ago

I think doing this would be to extend Schematron beyond its core purpose, and there are other existing tools, such as XProc, which would be better placed to serve this need by orchestrating conversion to XML and supplying the result to Schematron. The scenarios described seem to me to be clear cases for pipelining, which other tools such as XProc are designed specifically to handle already.

If this enhancement were approved, I agree it should definitely be optional in implementations.

@rjelliffe , your three "other scenarios" (descriptions beginning "We want...") - can you elaborate on exactly why these are wanted, with concrete examples?

rjelliffe commented 2 years ago

@AndrewSales : I have updated the original post added examples for 3) and 2) as requested.

AndrewSales commented 2 years ago

Many thanks, @rjelliffe .

rjelliffe commented 2 years ago

Just as a note of interest (only to me): the schema language I was working on in 1999 immediately before Schematron was called "XML Notation Schemas". https://schematron.com/document/3046.html?publicationid=

This current proposal for converters on non-XML to allow validation, is a way to implement what XML Notations Schemas was proposing then, finally!

Instead of vaguely talking of "BNF" we now have the more concrete iXML. I toyed with the idea in Schematron v1.1 of adding something like this, but de-prioritized it when no obvious method sprang to mind, and because XSLT 1 was not very capable.

The idea of this language was to support specification and validation of embedded notations (as distinct from external files with some different notation) and to link them to validators/generators. So you could specify the lexical model for your notation (e.g. in Regular expression or BNF) then the notation would be tokenized by it and these tokens could then be validated as if they were element names by e.g. a content model. The XML Notation Schema allowed these complex data to be named and validated in an extensible way.

I tried to get the XML Schema WG interested in the idea, but a corporate member there thought it was a competitor to types which were really good while notations were an SGML idea and therefore really bad. "Without types you can do nothing" he said: utter nonsense. So XSD supported regexes but not grammars, and it did not allow testing constraints within the regexes or between some "captured text" of the regex and the rest of the document: so even with its regexes XSD again managed to extract the least bang-per-buck.

rjelliffe commented 2 years ago

So how does Schematron talk to the XProc which invoked it?

For example, say I have an instance document to validate

< info>
    ...
       < one-of-multiple-arbitraty-nested-element data="some string I want
to parse with ixml 123 " / >
   ...
< / info >

In my schema proposal, I can read the attribute each time it is encountered into a variable, convert to XML, and access that.

< sch:schema .... >
   <sch:conversion id=" OOMANE-data "  ... />
 ...
<sch:rule context=" one-of-multiple-arbitrary-nested-element ">
   <sch:let name="data-as-xml" value=" @ data " using-conversion="
OOMANE-data "/>

     <sch:assert test=" $data-as-xml//thing='123' "> The data attribute
should have "123" thing</sch:assert>

I don't see how XProc fits in. Is it supposed to duplicate the phase/pattern/rule/variable logic to identify strings, then pass them in somehow to Schematron at invocation time? And the schema would have to be written knowing what names the XProc was being written for.

I can see that XProc could fit in, for scenario 1 (the main document is non-XML) only. But for the other scenarios, I dont see how.

Cheers Rick

On Fri, Aug 5, 2022 at 8:59 PM Andrew Sales @.***> wrote:

I think doing this would be to extend Schematron beyond its core purpose, and there are other existing tools, such as XProc, which would be better placed to serve this need by orchestrating conversion to XML and supplying the result to Schematron. The scenarios described seem to me to be clear cases for pipelining, which other tools such as XProc are designed specifically to handle already.

If this enhancement were approved, I agree it should definitely be optional in implementations.

@rjelliffe https://github.com/rjelliffe , your three "other scenarios" (descriptions beginning "We want...") - can you elaborate on exactly why these are wanted, with concrete examples?

— Reply to this email directly, view it on GitHub https://github.com/Schematron/schematron-enhancement-proposals/issues/48#issuecomment-1206319893, or unsubscribe https://github.com/notifications/unsubscribe-auth/AF65KKOP6ZRC2BCLMVYQ5VTVXTX23ANCNFSM55USMXKQ . You are receiving this because you were mentioned.Message ID: @.*** com>

tgraham-antenna commented 2 years ago

Preprocessing and 'in'-processing (for want of a better term) would both have their place, but I'm inclined to agree with @AndrewSales that Schematron doesn't necessarily need to be extended to do preprocessing when there's other, well-known or even standardised ways to do preprocessing.

If I wanted to validate CSS using Schematron, if I wanted to validate the CSS as a whole, then I probably would preprocess the CSS into XML using something like iXML and validate the XML. If I wanted to validate the CSS that applies to particular elements, then I would probably preprocess the HTML+CSS using something like Transpect's CSS tools (https://github.com/transpect/css-tools) to annotate the HTML with attributes for individual properties and validate those.

For validating the syntax of individual attribute values, focheck already does the data-as-xml conversion, but using XSLT rather than something standardised (and opaque).

Compare:

<sch:rule context=" one-of-multiple-arbitrary-nested-element ">
   <sch:let name="data-as-xml" value=" @ data " using-conversion="
OOMANE-data "/>

     <sch:assert test=" $data-as-xml//thing='123' "> The data attribute
should have "123" thing</sch:assert>

and:

<!-- axf:background-content-width -->
<!-- auto | scale-to-fit | scale-down-to-fit | scale-up-to-fit | <length> | <percentage> | inherit -->
<!-- Inherited: no -->
<!-- Shorthand: no -->
<!-- https://www.antenna.co.jp/AHF/help/en/ahf-ext.html#axf.background-content -->
<rule context="fo:*/@axf:background-content-width">
  <let name="expression" value="ahf:parser-runner(.)"/>
  <assert test="local-name($expression) = ('EnumerationToken', 'Length', 'Percent', 'EMPTY', 'ERROR', 'Object')">content-width="<value-of select="."/>" should be EnumerationToken, Length, or Percent.  '<value-of select="."/>' is a <value-of select="local-name($expression)"/>.</assert>
  <report test="$expression instance of element(EnumerationToken) and not($expression/@token = ('auto', 'scale-to-fit', 'scale-down-to-fit', 'scale-up-to-fit', 'inherit'))">content-width="<value-of select="."/>" enumeration token is '<value-of select="$expression/@token"/>'.  Token should be 'auto', 'scale-to-fit', 'scale-down-to-fit', 'scale-up-to-fit', or 'inherit'.</report>
  <report test="local-name($expression) = 'EMPTY'" role="Warning">content-width="" should be EnumerationToken, Length, or Percent.</report>
  <report test="local-name($expression) = 'ERROR'">Syntax error: content-width="<value-of select="."/>"</report>
</rule>

P.S. It turns out that email replies on issues can't contain Markdown, so Markdown isn't, and can't be edited to be recognised as, Markdown.

kosek commented 1 year ago

I agree that supporting something like iXML can be convenient but I'm not sure this balances complexity added into the Schematron. Validation of non-XML inputs can be already done either by preprocessing or by calling functions that turn non-XML syntax into XML, e.g.: