RazrFalcon / xmlparser

A low-level, pull-based, zero-allocation XML 1.0 parser.
Apache License 2.0
130 stars 16 forks source link

Add function to unescape string #26

Closed dragazo closed 1 year ago

dragazo commented 1 year ago

For the Token::Attribute variant and similar, the StrSpan returned can get the original string value, but this library doesn't seem to have a function to unescape xml-encoded strings, so I've had to hand-roll my own version of this. It would be great if this crate had something like:

#[cfg(feature = "alloc")]
pub fn unescape(&str) -> Result<alloc::borrow::Cow<str>, UnescapeError>;
RazrFalcon commented 1 year ago

You cannot properly unescape XML strings without parsing DTD. This is what roxmltree does for you. Not to mention that text node strings, attribute strings and CDATA strings all have to be handled differently. XML is a mess. I don't see a way it can be implemented on the xmlparser side. Too much complexity and context required.

dragazo commented 1 year ago

Oof, I wasn't aware xml had a macro-like system like that... Still, I think this crate could at least provide an unescape for the builtin xml escape codes (not custom ones that would need context from the file). As long as it's documented, or named intentionally (e.g., unescape_builtins), I think it would be fine. As it stands, users have to hand-roll an unescape function even for trivial cases like foo&amp;bar.

If something like that would be desirable, I'd be willing to make a PR for it. I have a version of it already in my project, and I could just optimize it a bit.

RazrFalcon commented 1 year ago

I will think about it. As I mentioned above, it's not that simple, since text node text and attribute value text have to be escaped differently. So the API would be a bit convoluted.

dragazo commented 1 year ago

Short of reading a spec manual, I haven't been able to find anything saying the (builtin) escape sequences are different for text nodes and attribute values. I've seen cases where, e.g., > doesn't have to be escaped in attr text specifically (and similar), like v="a&gt;b" or v="a>b". But as far as I can tell they both get parsed as v="a>b", so the same unescape logic is used as text nodes. CDATA has different semantics, but that would just be the user's responsibility to know not to unescape the content.

RazrFalcon commented 1 year ago

The unescape function should unescape new lines as well. And they have to be handled differently.

If you want to escape just &gt; and friends - you can easily do it on your side.

dragazo commented 1 year ago

The unescape function should unescape new lines as well. And they have to be handled differently.

While text nodes can contain literal new line characters, as far as I can tell they still unambiguously unescape CR and LF (&#xD; and &#xA;) if present regardless, so it seems like the same unescape function as attr text would be fine, wouldn't it?

If you want to escape just &gt; and friends - you can easily do it on your side.

I mean it's also the &#[0-9]+; and &#x[0-9a-fA-F]+; numeric forms. That's kind of non-trivial and literally every user that needs unescaping would have to hand-roll their own version like I did. I mean it's not crazy hard, but very annoying. The only way I'd say that's "easy" for users is .replace(...).replace(...).replace(...)... for all the standard escape sequences, which is wildly inefficient and doesn't work for the numeric encoded escapes. And if there was a single impl in this crate, we could add more efficient options like zero-copy (when possible).

RazrFalcon commented 1 year ago

The current text unescape code in roxmltree is nearly 400 LOC. It's far from trivial and I'm not sure it would be easy to port to xmlparser. Maybe one day. For now - use your own implementation.

I mean it's also the &#[0-9]+; and &#x[0-9a-fA-F]+; numeric forms

This is already available via Stream::try_consume_reference. See example here.

dragazo commented 1 year ago

The current text unescape code in roxmltree is nearly 400 LOC. It's far from trivial and I'm not sure it would be easy to port to xmlparser. Maybe one day. For now - use your own implementation.

Well they also do context-based entity resolving, which I was suggesting we not handle - just basic escape codes. But even better:

This is already available via Stream::try_consume_reference. See example here.

If that exists, why not just provide a function like unescape_with from quick-xml:

pub fn unescape_with<'a, F: Fn(&str) -> &str>(content: &'a str, entity_resolver: F) -> Cow<'a, str>;

That could use the existing stream features this crate already has, consumes characters that are encoundered (including numeric escape sequences which are already supported), and any entity references are just handed off to the passed-in entity resolver.

Then it really would be easy for users to hand roll their own since we added the generic form for any resolver:

fn my_unescape(s: &str) -> Cow<str> {
    xmlparser::unescape_with(s, |x| match x {
        "name" => "marco polo",
        "age" => "185",
        "gt" => ">",
        ....
    })
}
RazrFalcon commented 1 year ago

If that exists, why not just provide a function like unescape_with from quick-xml:

Because it's not XML complaint! You want a function that is good enough for you. I want a function that is correct. Period.

There is no "just", "basic" or "simple" in XML. XML is hated for a reason.

As I said before, there are no plans to implement such functionality. You have to write your own.

dragazo commented 1 year ago

The point is it would be up to the user to have XML-compliant entity resolving if they want that, cause we'd only handle numeric escape codes, which should always be unescaped in every context. I fail to see how that's such an unacceptable compromise - there would be no non-compliant logic in this crate. I just want something that's most convenient for users. But since you're so adamantly against it, I'll just make my own util crate.

RazrFalcon commented 1 year ago

The point is it would be up to the user to have XML-compliant entity resolving if they want that

The point is that XML text unescaping doesn't produce a string. It produces tokens. Here is a valid XML:

<!DOCTYPE test [
    <!ENTITY p '<p/>'>
]>
<root>&p;</root>

How exactly do you plan to unescape it? It must produce a p element, no a <p/> string.

Read the spec before complaining.

I just want something that's most convenient for users.

It already exist and is called roxmltree.

dragazo commented 1 year ago

Ok, well that's a valid reason. God, XML is a mess...

I just can't use roxmltree cause it's not fully no-std compatible. alloc::sync::Arc is only available on some platforms, which excludes most embedded applications like mine.