Blacksmoke16 / oq

A performant, and portable jq wrapper to facilitate the consumption and output of formats other than JSON; using jq filters to transform the data.
https://blacksmoke16.github.io/oq/
MIT License
190 stars 15 forks source link

Ignore character data in mixed content XML #10

Open nichtich opened 5 years ago

nichtich commented 5 years ago

As discussed at #7, document-oriented XML requires another JSON serialization anyway. Supporting mixed content XML in the current JSON form is error-prone anyway. Even this simple case seems to be handled wrong (not the additional whitespace after x):

echo '{ "y": { "#text": "x", "z": "1" } }' | oq -o xml .
<?xml version="1.0" encoding="UTF-8"?>
<root>
  <y>x    <z>1</z>
  </y>
</root>

The reason is whitespace handling in mixed content elements requires a more sophisticated algorithm (see this explanation).

Better ignore character data (#text) when there are child elements:

echo '{ "y": { "#text": "x", "z": "1" } }' | oq -o xml .
<?xml version="1.0" encoding="UTF-8"?>
<root>
  <y>
    <z>1</z>
  </y>
</root>
Blacksmoke16 commented 5 years ago

@nichtich This seems related to a bug in libxml2. https://bugzilla.gnome.org/show_bug.cgi?id=760160.

Blacksmoke16 commented 5 years ago

I checked with libxml2 devs and they said that this is the expected behavior as they do not alter whitespace.

Given I can't think of a way to replicate

<p>  The <emph> cat </emph> ate  the <foreign>grande croissant</foreign>. I didn't </p>

since #text only would allow defining the first The; I think this is worth talking about.

@nichtich Do you think its worth explicitly removing it, versus just leaving that up to the end user? The code to implement this change feels a bit hacky to me. I'll see if I can figure out a better solution.

nichtich commented 5 years ago

I checked with libxml2 devs and they said that this is the expected behavior as they do not alter whitespace.

It's a bug if you use libxml2 with indent value other than 0. This works:

echo '{ "y": { "#text": "x ", "z": "1" } }' | oq -o xml --indent 0 .
<?xml version="1.0" encoding="UTF-8"?>
<root><y>x <z>1</z></y></root>

You could ignore character data in mixed content XML unless indent is 0.

Blacksmoke16 commented 5 years ago

This would require some logic that would see if there are other keys that aren't #text or @*, and if so, don't emit the #text value's node.

But because of #18, I'm now not loading anything into memory and only reading one token a time; which makes this much trickier.