RazrFalcon / xmlparser

A low-level, pull-based, zero-allocation XML 1.0 parser.
Apache License 2.0
130 stars 16 forks source link

Do not expand predefined references in Stream::consume_reference #18

Closed Jesse-Bakker closed 3 years ago

Jesse-Bakker commented 3 years ago

When processing entity declaration, only internal references should be expanded. Predefined references should be bypassed. Just parsing these as Entity instead of Char allows for this

RazrFalcon commented 3 years ago

Can you add a test case for this (I will fix ci soon)? And can you confirm that it doesn't brake roxmltree tests?

Jesse-Bakker commented 3 years ago

Yea sure

Jesse-Bakker commented 3 years ago

I added a test. There are test failures in roxmltree though, but that was to be expected. Predefined entities should be expanded when their containing entity is referenced in the xml document.

For context, the reason for this change is that I ran into some issues with expansion of DTD entity references where the entity contained an element with a name containing character references. To fix that properly in roxmltree, I need this change, as character references are expanded in entities, but predefined entity references are not. See https://www.w3.org/TR/xml/#sec-entexpand for examples.

A minimal example where this matters is:

<!DOCTYPE root
<!ENTITY ent "<&#xE14;>&lt;</&#xE14;>">
]>
<root>
&ent;
</root>

The entity should expand to:

<!ENTITY ent "<ด>&lt;</ด>">

Where &le; is not expanded. If &lt; were expanded to <, the entity definition would be invalid.

RazrFalcon commented 3 years ago

Hm... Yes, this part of XML is pretty nightmarish. I've spend a lot of time on it and Im still not sure how it suppose to work.

RazrFalcon commented 3 years ago

Also, could you add this test to tests/integration/doctype.rs

RazrFalcon commented 3 years ago

Oh, this is part of the steam API, not tokenizer... Can you put those tests into the doc comment, like this:

    /// Consumes an XML reference.
    ///
    /// Consumes according to: <https://www.w3.org/TR/xml/#NT-Reference>
    ///
    /// # Errors
    ///
    /// - `InvalidReference`
    ///
    /// # Examples
    ///
    /// Escaped character:
    ///
    /// ```
    /// use xmlparser::{Stream, Reference};
    ///
    /// let mut stream = Stream::from("&#xe13;");
    /// assert_eq!(stream.consume_reference().unwrap(), Reference::Char('\u{e13}'));
    /// ```
    ///
    /// Predefined references will not be expanded.
    /// [Details.]( https://www.w3.org/TR/xml/#sec-entexpand)
    ///
    /// ```
    /// use xmlparser::{Stream, Reference};
    ///
    /// let mut stream = Stream::from("&le;");
    /// assert_eq!(stream.consume_reference().unwrap(), Reference::Entity("le"));
    /// ```
    ///
    /// Named reference:
    ///
    /// ```
    /// use xmlparser::{Stream, Reference};
    ///
    /// let mut stream = Stream::from("&other;");
    /// assert_eq!(stream.consume_reference().unwrap(), Reference::Entity("other"));
    pub fn consume_reference(&mut self) -> Result<Reference<'a>> {
        self._consume_reference().map_err(|_| StreamError::InvalidReference)
    }

Sorry once again. I haven't looked into this project for a while.

RazrFalcon commented 3 years ago

Thanks again!

rcoh commented 3 years ago

This breaks roxmltree (or at least it causes a behavior regression). Now, when &lt; et al are passed to roxmltree, the parser will error out.

RazrFalcon commented 3 years ago

Wtf... I have yanked 0.13.4

Jesse-Bakker commented 3 years ago

This was expected to be a breaking change right?

roxmltree needs some changes to expand predefined entities itself in the right places (for example by having them in the entities field of the parser state by default), as right now, these references are to undefined entities.

RazrFalcon commented 3 years ago

I wasn't expecting it to break all entities. This was my mistake. I should have tested everything myself.

RazrFalcon commented 3 years ago

@Jesse-Bakker Here is how lxml parses your example (generated by lxml-ast.py):

Document:
  - Element:
      tag_name: root
      children:
        - Text: "\n"
        - Element:
            tag_name: ด
            children:
              - Text: "<"
        - Text: "\n"

Does this mean that lxml/libxml2 is also broken? I doubt that.

Jesse-Bakker commented 3 years ago

No, lxml is not broken. The entity should be parsed to <!ENTITY ent "<ด>&lt;</ด>">. When that entity is referenced from the body (<root> &ent; </root>, the entity body is parsed and the &lt; is expanded to <, making that the text of the root element.

If, instead of &lt; a character reference to < is used (&#x3c), the entity definition should fail to parse and that is why predefined references should be bypassed when parsing EntityValue, but not when parsing content. More info here: https://www.w3.org/TR/REC-xml/#entproc

RazrFalcon commented 3 years ago

Can you provide an example that should fail?

Jesse-Bakker commented 3 years ago

Should fail:

<!DOCTYPE root
<!ENTITY ent "<&#xE14;>&#x3c;</&#xE14;>">
]>
<root>
&ent;
</root>

, because &#x3c (<) will make the entity expand to <!ENTITY ent "<ด><</ด>"> which is invalid.

Should succeed:

<!DOCTYPE root
<!ENTITY ent "<&#xE14;>&lt;</&#xE14;>">
]>
<root>
&ent;
</root>

, because the entity will expand to <!ENTITY ent "<ด>&lt;</ด>">

RazrFalcon commented 3 years ago

Well, I have no idea how to fix it in roxmltree. I've created https://github.com/RazrFalcon/roxmltree/issues/50 for now. Maybe I will be able to wrap my head around it.