Closed Jesse-Bakker closed 3 years ago
Can you add a test case for this (I will fix ci soon)? And can you confirm that it doesn't brake roxmltree tests?
Yea sure
I added a test. There are test failures in roxmltree though, but that was to be expected. Predefined entities should be expanded when their containing entity is referenced in the xml document.
For context, the reason for this change is that I ran into some issues with expansion of DTD entity references where the entity contained an element with a name containing character references. To fix that properly in roxmltree, I need this change, as character references are expanded in entities, but predefined entity references are not. See https://www.w3.org/TR/xml/#sec-entexpand for examples.
A minimal example where this matters is:
<!DOCTYPE root
<!ENTITY ent "<ด><</ด>">
]>
<root>
&ent;
</root>
The entity should expand to:
<!ENTITY ent "<ด><</ด>">
Where ≤
is not expanded. If <
were expanded to <
, the entity definition would be invalid.
Hm... Yes, this part of XML is pretty nightmarish. I've spend a lot of time on it and Im still not sure how it suppose to work.
Also, could you add this test to tests/integration/doctype.rs
Oh, this is part of the steam API, not tokenizer... Can you put those tests into the doc comment, like this:
/// Consumes an XML reference.
///
/// Consumes according to: <https://www.w3.org/TR/xml/#NT-Reference>
///
/// # Errors
///
/// - `InvalidReference`
///
/// # Examples
///
/// Escaped character:
///
/// ```
/// use xmlparser::{Stream, Reference};
///
/// let mut stream = Stream::from("ณ");
/// assert_eq!(stream.consume_reference().unwrap(), Reference::Char('\u{e13}'));
/// ```
///
/// Predefined references will not be expanded.
/// [Details.]( https://www.w3.org/TR/xml/#sec-entexpand)
///
/// ```
/// use xmlparser::{Stream, Reference};
///
/// let mut stream = Stream::from("≤");
/// assert_eq!(stream.consume_reference().unwrap(), Reference::Entity("le"));
/// ```
///
/// Named reference:
///
/// ```
/// use xmlparser::{Stream, Reference};
///
/// let mut stream = Stream::from("&other;");
/// assert_eq!(stream.consume_reference().unwrap(), Reference::Entity("other"));
pub fn consume_reference(&mut self) -> Result<Reference<'a>> {
self._consume_reference().map_err(|_| StreamError::InvalidReference)
}
Sorry once again. I haven't looked into this project for a while.
Thanks again!
This breaks roxmltree (or at least it causes a behavior regression). Now, when <
et al are passed to roxmltree
, the parser will error out.
Wtf... I have yanked 0.13.4
This was expected to be a breaking change right?
roxmltree needs some changes to expand predefined entities itself in the right places (for example by having them in the entities
field of the parser state by default), as right now, these references are to undefined entities.
I wasn't expecting it to break all entities. This was my mistake. I should have tested everything myself.
@Jesse-Bakker Here is how lxml
parses your example (generated by lxml-ast.py):
Document:
- Element:
tag_name: root
children:
- Text: "\n"
- Element:
tag_name: ด
children:
- Text: "<"
- Text: "\n"
Does this mean that lxml/libxml2 is also broken? I doubt that.
No, lxml is not broken. The entity should be parsed to <!ENTITY ent "<ด><</ด>">
. When that entity is referenced from the body (<root> &ent; </root>
, the entity body is parsed and the <
is expanded to <
, making that the text of the root
element.
If, instead of <
a character reference to <
is used (<
), the entity definition should fail to parse and that is why predefined references should be bypassed when parsing EntityValue
, but not when parsing content. More info here: https://www.w3.org/TR/REC-xml/#entproc
Can you provide an example that should fail?
Should fail:
<!DOCTYPE root
<!ENTITY ent "<ด><</ด>">
]>
<root>
&ent;
</root>
, because <
(<
) will make the entity expand to <!ENTITY ent "<ด><</ด>">
which is invalid.
Should succeed:
<!DOCTYPE root
<!ENTITY ent "<ด><</ด>">
]>
<root>
&ent;
</root>
, because the entity will expand to <!ENTITY ent "<ด><</ด>">
Well, I have no idea how to fix it in roxmltree. I've created https://github.com/RazrFalcon/roxmltree/issues/50 for now. Maybe I will be able to wrap my head around it.
When processing entity declaration, only internal references should be expanded. Predefined references should be bypassed. Just parsing these as
Entity
instead ofChar
allows for this