Closed pzel closed 9 years ago
Parsing works fine:
(exml@x4)8> exml:to_binary(element(2, {ok, _} = exml:parse(M2Txt))).
(<0.56.0>) call exml:parse(<<"<frob>&</frob>">>)
(<0.56.0>) returned from exml:parse/1 -> {ok,
{xmlel,<<"frob">>,[],
[{xmlcdata,<<"&">>}]}}
(<0.56.0>) call exml:to_binary({xmlel,<<"frob">>,[],[{xmlcdata,<<"&">>}]})
(<0.56.0>) call exml:to_iolist({xmlel,<<"frob">>,[],[{xmlcdata,<<"&">>}]})
(<0.56.0>) call exml:attrs_to_iolist([],[])
(<0.56.0>) returned from exml:attrs_to_iolist/2 -> []
(<0.56.0>) call exml:to_iolist([{xmlcdata,<<"&">>}])
(<0.56.0>) call exml:'-to_iolist/1-fun-0-'({xmlcdata,<<"&">>})
(<0.56.0>) call exml:to_iolist({xmlcdata,<<"&">>})
(<0.56.0>) returned from exml:to_iolist/1 -> [<<"&">>]
(<0.56.0>) returned from exml:'-to_iolist/1-fun-0-'/1 -> [<<"&">>]
(<0.56.0>) returned from exml:to_iolist/1 -> [[<<"&">>]]
(<0.56.0>) returned from exml:to_iolist/1 -> ["<",<<"frob">>,[],">",
[[<<"&">>]],
"</",<<"frob">>,">"]
(<0.56.0>) returned from exml:to_binary/1 -> <<"<frob>&</frob>">>
<<"<frob>&</frob>">>
It's printing that causes the problem:
(exml@x4)9> exml:to_iolist([{xmlcdata,<<"&">>}]).
(<0.56.0>) call exml:to_iolist([{xmlcdata,<<"&">>}])
(<0.56.0>) call exml:'-to_iolist/1-fun-0-'({xmlcdata,<<"&">>})
(<0.56.0>) call exml:to_iolist({xmlcdata,<<"&">>})
(<0.56.0>) returned from exml:to_iolist/1 -> [<<"&">>]
(<0.56.0>) returned from exml:'-to_iolist/1-fun-0-'/1 -> [<<"&">>]
(<0.56.0>) returned from exml:to_iolist/1 -> [[<<"&">>]]
[[<<"&">>]]
BUT (exml.erl:56
):
to_iolist(#xmlcdata{content = Content}) ->
%% it's caller's responsibility to make sure that
%% #xmlcdata's content is escaped properly!
[Content]. %% ensure we return io*list*
To cut long story short, I agree the sound plan is to make these operations reverses of each other, but in the transitional period it might break all call sites where the caller actually makes sure the CDATAs are escaped.
I understand. This just shows that the function names are wrong. A function named exml:show/1
or exml:print/1
can be reasonably expected to produce a legible string/binary, but not necessarily one fit for computer consumption.
The name exml:to_binary/1
suggests to the user that this function can be depended upon to just serialize the data to a binary format. (like the term_to_binary / binary_to_term
dyad).
Expecting the user to search an entire xml tree, find #xmlcdata{}
s, and make sure they are serializable breaks the rule of least surprise.
If exml
has read some data and deemed it well-formed, it should be able to produce that well-formed data again. Channeling Jef Raskin: The system should treat all user input as sacred.
Let's consider making a new, major version that escapes CDATA automatically. What are your thoughts?
In other news: PropEr tests show that many valid utf8 sequences cannot be binarized and unbinarized safely.
I've started fixing this, in order to release a non-backwards compatible fix. As usual, the rabbit hole is deeper than I could've imagined (i.e. tests in test/eqc
don't pass even without any changes).