esl / exml

XML parsing library in Erlang
Apache License 2.0
30 stars 23 forks source link

Exml to binary produces malformed XML #11

Closed pzel closed 9 years ago

pzel commented 9 years ago
1> M1Txt = <<"<frob>hello</frob>">>.
<<"<frob>hello</frob>">>

2> exml:to_binary(element(2, {ok, _} = exml:parse(M1Txt))).
<<"<frob>hello</frob>">>

3> M2Txt = <<"<frob>&amp;</frob>">>.
<<"<frob>&amp;</frob>">>

4> exml:to_binary(element(2, {ok, _} = exml:parse(M2Txt))).
<<"<frob>&</frob>">>

5> exml:parse(exml:to_binary(element(2, {ok, _} = exml:parse(M2Txt)))).
{error,{"not well-formed (invalid token)",
        <<"<stream><frob>&</frob></stream>">>}}
erszcz commented 9 years ago

Parsing works fine:

(exml@x4)8> exml:to_binary(element(2, {ok, _} = exml:parse(M2Txt))).
(<0.56.0>) call exml:parse(<<"<frob>&amp;</frob>">>)
(<0.56.0>) returned from exml:parse/1 -> {ok,
                                          {xmlel,<<"frob">>,[],
                                           [{xmlcdata,<<"&">>}]}}
(<0.56.0>) call exml:to_binary({xmlel,<<"frob">>,[],[{xmlcdata,<<"&">>}]})
(<0.56.0>) call exml:to_iolist({xmlel,<<"frob">>,[],[{xmlcdata,<<"&">>}]})
(<0.56.0>) call exml:attrs_to_iolist([],[])
(<0.56.0>) returned from exml:attrs_to_iolist/2 -> []
(<0.56.0>) call exml:to_iolist([{xmlcdata,<<"&">>}])
(<0.56.0>) call exml:'-to_iolist/1-fun-0-'({xmlcdata,<<"&">>})
(<0.56.0>) call exml:to_iolist({xmlcdata,<<"&">>})
(<0.56.0>) returned from exml:to_iolist/1 -> [<<"&">>]
(<0.56.0>) returned from exml:'-to_iolist/1-fun-0-'/1 -> [<<"&">>]
(<0.56.0>) returned from exml:to_iolist/1 -> [[<<"&">>]]
(<0.56.0>) returned from exml:to_iolist/1 -> ["<",<<"frob">>,[],">",
                                              [[<<"&">>]],
                                              "</",<<"frob">>,">"]
(<0.56.0>) returned from exml:to_binary/1 -> <<"<frob>&</frob>">>
<<"<frob>&</frob>">>

It's printing that causes the problem:

(exml@x4)9> exml:to_iolist([{xmlcdata,<<"&">>}]).
(<0.56.0>) call exml:to_iolist([{xmlcdata,<<"&">>}])
(<0.56.0>) call exml:'-to_iolist/1-fun-0-'({xmlcdata,<<"&">>})
(<0.56.0>) call exml:to_iolist({xmlcdata,<<"&">>})
(<0.56.0>) returned from exml:to_iolist/1 -> [<<"&">>]
(<0.56.0>) returned from exml:'-to_iolist/1-fun-0-'/1 -> [<<"&">>]
(<0.56.0>) returned from exml:to_iolist/1 -> [[<<"&">>]]
[[<<"&">>]]

BUT (exml.erl:56):

to_iolist(#xmlcdata{content = Content}) ->
    %% it's caller's responsibility to make sure that
    %% #xmlcdata's content is escaped properly!
    [Content]. %% ensure we return io*list*

To cut long story short, I agree the sound plan is to make these operations reverses of each other, but in the transitional period it might break all call sites where the caller actually makes sure the CDATAs are escaped.

pzel commented 9 years ago

I understand. This just shows that the function names are wrong. A function named exml:show/1 or exml:print/1 can be reasonably expected to produce a legible string/binary, but not necessarily one fit for computer consumption.

The name exml:to_binary/1 suggests to the user that this function can be depended upon to just serialize the data to a binary format. (like the term_to_binary / binary_to_term dyad).

Expecting the user to search an entire xml tree, find #xmlcdata{}s, and make sure they are serializable breaks the rule of least surprise.

If exml has read some data and deemed it well-formed, it should be able to produce that well-formed data again. Channeling Jef Raskin: The system should treat all user input as sacred.

Let's consider making a new, major version that escapes CDATA automatically. What are your thoughts?

In other news: PropEr tests show that many valid utf8 sequences cannot be binarized and unbinarized safely.

erszcz commented 9 years ago

I've started fixing this, in order to release a non-backwards compatible fix. As usual, the rabbit hole is deeper than I could've imagined (i.e. tests in test/eqc don't pass even without any changes).