djc / rnc2rng

RELAX NG Compact to regular syntax conversion library
MIT License
16 stars 13 forks source link

Parsing errors on schemas provided by the W3C #42

Open LeonardusSagittarius opened 1 year ago

LeonardusSagittarius commented 1 year ago

rnc2rng returns parsing errors when trying to convert rnc schemas provided by the W3C (which means, you should expect they're syntactically correct)

Converting https://www.w3.org/TR/xmlsec-rngschema/xmldsig-core-schema.rnc gives me the error message:

parse error in xmldsig-core-schema.rnc [118:20] xsd:anyURI "http://www.w3.org/TR/2001/REC-xml-c14n-20010315" ^

(The ^ is pointing at the first quotation mark)

The W3C provides an rng version of the schema (I cannot use it, as the rnc version is included by another schema): https://www.w3.org/TR/xmlsec-rngschema/xmldsig-core-schema.rng

Converting https://www.w3.org/TR/xmlsec-rngschema/xenc-allowAnyForeign.rnc gives the error message:

xenc-allowAnyForeign.rnc parse error in xenc-allowAnyForeign.rnc [23:16] xsd:anyURI - xenc_EncryptionAlgorithms }, ^

(The ^ is pointing at the dash after "xsd:anyURI")

The W3C provides an rng version of the schema (Again, I cannot use it, as the rnc version is included by another schema): https://www.w3.org/TR/xmlsec-rngschema/xenc-allowAnyForeign.rng

I did some research. It seems, setting a value and a value type this way is actually allowed, but never really mentioned in the official specification.

Sorry for my bad English.

djc commented 1 year ago

As mentioned in #41 I really don't have much interest in maintaining this project anymore. Would you be interested in submitting a PR for this? I could probably review it. Otherwise, I would be open to having work on this sponsored.

jombr commented 2 months ago

Problem 1: xmldsig-core-schema.rnc

I'm able to reproduce this using tag 2.6.6

(.venv) python -m rnc2rng /home/deck/git/notes/w3c-schema/xmldsig-core-schema.rnc
parse error in /home/deck/git/notes/w3c-schema/xmldsig-core-schema.rnc [118:20]
        xsd:anyURI "http://www.w3.org/TR/2001/REC-xml-c14n-20010315"
                   ^

With tag 2.7.0 it looks to have been fixed by #44 commit hash a61678dca210dff80de573a1e06b36c8cf297424 We are now getting the following translation "near" 118:20

  <define name="ds_CanonicalizationMethodType">
    <choice>
      <attribute>
        <name ns="">Algorithm</name>
        <choice>
          <value type="anyURI">http://www.w3.org/TR/2001/REC-xml-c14n-20010315</value>
          <value type="anyURI">http://www.w3.org/TR/2001/REC-xml-c14n-20010315#WithComments</value>
        </choice>
      </attribute>
      <attribute>
        <name ns="">Algorithm</name>
        <choice>
          <value type="anyURI">http://www.w3.org/2006/12/xml-c14n11</value>
          <value type="anyURI">http://www.w3.org/2006/12/xml-c14n11#WithComments</value>
        </choice>
      </attribute>
    </choice>
  </define>

Which is similar to xmldsig-core-schema.rng linked above...

  <define name="ds_CanonicalizationMethodType">
    <choice>
      <attribute name="Algorithm">
        <choice>
          <value type="anyURI">http://www.w3.org/TR/2001/REC-xml-c14n-20010315</value>
          <value type="anyURI">http://www.w3.org/TR/2001/REC-xml-c14n-20010315#WithComments</value>
        </choice>
      </attribute>
      <attribute name="Algorithm">
        <choice>
          <value type="anyURI">http://www.w3.org/2006/12/xml-c14n11</value>
          <value type="anyURI">http://www.w3.org/2006/12/xml-c14n11#WithComments</value>
        </choice>
      </attribute>
    </choice>
  </define>

Details About the Fix

The syntax, parser impl, and formal syntax
The fix introduced `'primary : CNAME strlit'` which works with one of the pattern choices in the grammar from https://relaxng.org/compact-20021121.html#syntax ``` pattern ::= ... | [datatypeName] datatypeValue datatypeName ::= CName datatypeValue ::= literal ``` The path `pattern` it takes in the parser is... > this isn't an entire trace ``` 'primary : ATTRIBUTE name-class LBRACE pattern RBRACE' ... do the next step just for pattern 'pattern : particle-choice' 'particle-choice : particle PIPE particle' ... Do the next steps for each particle 'particle : annotated-primary' 'annotated-primary : annotations primary' ... do the next step just for primary ... and finally ... 'primary : CNAME strlit' ``` Things like `annotated-primary` and `particle-choice` follow the formal syntax https://relaxng.org/compact-20021121.html#formal-syntax In the formal syntax we see... ``` primary returns Element ::= ... some choice options including... | datatypeName datatypeValue datatypeName returns Attributes ::= CName | "string" | "token" datatypeValue returns String ::= literal ``` datatypeName and datatypeValue don't have their own `@pg.production` and have been "flattened" into the primary productions. Examples `datatypeName` flattened into `primary` : - https://github.com/djc/rnc2rng/blob/95086ea5bc9f528bea5c7aff1cab76a14891b821/rnc2rng/parser.py#L428 - https://github.com/djc/rnc2rng/blob/95086ea5bc9f528bea5c7aff1cab76a14891b821/rnc2rng/parser.py#L440 - https://github.com/djc/rnc2rng/blob/95086ea5bc9f528bea5c7aff1cab76a14891b821/rnc2rng/parser.py#L452
jombr commented 2 months ago

Problem 2: xenc-allowAnyForeign.rnc

This is still an error in 2.7.0. The parser isn't supporting the following grammar...

pattern ::= ... some options...
    | "attribute" nameClass "{" pattern "}"

pattern ::= ... some options...
   | datatypeName ["{" param* "}"] [exceptPattern]

datatypeName ::= CName

exceptPattern ::= "-" pattern

pattern ::= ... some options...
   | identifier

identifier ::= (NCName - keyword)
   | quotedIdentifier

The EBNF grouping, `(NCName - keyword)` in this case would result in `xenc_KeyAgreementAlgorithms`

The formal syntax would be ...

primary returns Element  ::=
   ... some choice options including...
   |  "attribute"  nameClass] "{"  pattern  "}"

pattern returns Elements  ::=
    innerPattern

innerPattern(Xml anno) returns Elements  ::=
   ... some choice options including...
    |  annotatedDataExcept

annotatedDataExcept returns Elements  ::=
    leadAnnotatedDataExcept followAnnotations

leadAnnotatedDataExcept returns Element  ::=
    annotations dataExcept

dataExcept returns Element  ::=
    datatypeName  [optParams]  "-"  leadAnnotatedPrimary

datatypeName returns Attributes  ::=
    CName
    |  "string"
    |  "token"

leadAnnotatedPrimary returns Elements  ::=
    annotations primary

primary returns Element  ::=
   |  ref

ref returns String  ::=
    identifier

identifier returns String  ::=
    NCName - keyword

Since datatypeName and datatypeValue are already flattened into primary. We could flatten dataExcept into primary as something like...

# datatypeExcept ref
@pg.production('primary : CNAME MINUS identifier')
def primary_type_datatypeName_except_ref(s, p):
    right = Node('REF', p[2].name)
    minus = Node('EXCEPT', None, [right])
    left = Node('DATATAG', p[0].value, [minus])
    return left

This should cover your case...

attribute Algorithm { xsd:anyURI - xenc_KeyAgreementAlgorithms }

but wouldn't cover other possibilities for exceptPattern ::= "-" pattern where pattern is something like another datatypeName (or other pattern).

Example of a case that would be missed:

attribute Algorithm { xsd:anyURI - xsd:anyURI }

I think I'll probably go with one small change to fix this case since people aren't opening issues for the other cases. Then maybe later start pulling things like datatypeName, datattypeValue, and exceptPattern our of primary so each case doesn't have to be explicitly called out under primary (though they could).