Open nichtich opened 5 years ago
@nichtich This seems related to a bug in libxml2
. https://bugzilla.gnome.org/show_bug.cgi?id=760160.
I checked with libxml2
devs and they said that this is the expected behavior as they do not alter whitespace.
Given I can't think of a way to replicate
<p> The <emph> cat </emph> ate the <foreign>grande croissant</foreign>. I didn't </p>
since #text
only would allow defining the first The
; I think this is worth talking about.
@nichtich Do you think its worth explicitly removing it, versus just leaving that up to the end user? The code to implement this change feels a bit hacky to me. I'll see if I can figure out a better solution.
I checked with
libxml2
devs and they said that this is the expected behavior as they do not alter whitespace.
It's a bug if you use libxml2 with indent value other than 0
. This works:
echo '{ "y": { "#text": "x ", "z": "1" } }' | oq -o xml --indent 0 .
<?xml version="1.0" encoding="UTF-8"?>
<root><y>x <z>1</z></y></root>
You could ignore character data in mixed content XML unless indent is 0
.
This would require some logic that would see if there are other keys that aren't #text
or @*
, and if so, don't emit the #text
value's node.
But because of #18, I'm now not loading anything into memory and only reading one token a time; which makes this much trickier.
As discussed at #7, document-oriented XML requires another JSON serialization anyway. Supporting mixed content XML in the current JSON form is error-prone anyway. Even this simple case seems to be handled wrong (not the additional whitespace after
x
):The reason is whitespace handling in mixed content elements requires a more sophisticated algorithm (see this explanation).
Better ignore character data (
#text
) when there are child elements: