drmohundro / SWXMLHash

Simple XML parsing in Swift
MIT License
1.41k stars 205 forks source link

XMLElement.innerXML returns invalid XML for an attribute with an embedded quote #269

Closed pryder-fleetaero closed 1 year ago

pryder-fleetaero commented 1 year ago

I'm currently working on a project which uses SWXMLHash to "shred" the returned XML response from a web service to get to the useful deeply embedded content, at which point it passes that xml string (i.e. the deeply nested xml element of the original response) to another XML parser (i.e. XMLMapper) to do actual mapping to various structs etc.

The issue we're encountering is if the original has an attribute that has an embedded quote " in the correct XML escaped format of ", i.e.

<root>
    <test badAttribute="a&quot;b"/>
</root>

Then SWXMLHash's XMLElement.innerXML and String(describing: <XMLElement>) both return XML where the embedded attribute quote isn't correctly escaped. I.e.

<root>
    <test badAttribute="a"b">
    </test>
</root>

To Reproduce Steps to reproduce the behavior:

XMLHash.parse("<root><test badAttribute=\"a&quot;b\"/></root>").element!.innerXML

Or:

String(describing: XMLHash.parse("<root><test badAttribute=\"a&quot;b\"/></root>"))

Expected behavior

<root><test badAttribute="a&quot;b"></test></root>

Environment:

Additional context Add any other context about the problem here.

drmohundro commented 1 year ago

So, this is actually expected/intended behavior. See both https://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references#Predefined_entities_in_XML and https://stackoverflow.com/questions/1328538/how-do-i-escape-ampersands-in-xml-so-they-are-rendered-as-entities-in-html.

When the &quot; characters are read by the XML parser, they're now treated as ". If there is XML content being passed in that has nested content, that nested content should either go in a CDATA section or it will have to be sanitized/escaped first.

You can see that JavaScript XML parsing behaves the same way:

image
const xmlStr = '<root><test badAttribute=\"a&quot;b\"/></root>';
const parser = new DOMParser();
const doc = parser.parseFromString(xmlStr, "application/xml");

doc.querySelector('test');
pryder-fleetaero commented 1 year ago

Ah interesting. I was under the incorrect assumption then that any property etc. that returned an "XML" type representation would still have any XML entities present to be still considered valid XML in the parsing context (underly problem is we take the child element we in on using SWXMLHash and then pass it to an XMLMapper struct which uses NSXMLParser under the hood which throws an error due to the ‹test badAttribute="a"b"›</test> representation (it wants the embedded " escaped as "). Essentially we're using SWXMLHash kind of like the .selectSingleNode(xPath) in many DOM parsers just to cut to the element of interest and then parsing it's XML standalone.

Happy this behaviour is by design and in our case and we've worked around the issue with an XMLElement/XMLAttribute extension to get a representation with the XML entities still included in the attribute values which is then valid when parsed to NSXMLParser(/XMLMapper).

pryder-fleetaero commented 1 year ago

One thing I did note with your example though is if you query the .innerHTML property (I assume analogous to the .innerXML of the library), it does include the the XML entity for the escaping:

image
drmohundro commented 1 year ago

Interesting, thanks for sharing! I wonder if that is an HTML vs XML difference. To be honest, I wasn't aware of this prior to your submission... I would have guessed text within quotes wouldn't be parsed either, but it does seem consistent. That was being handled by the underlying NSXMLParser itself, though.