aantron / markup.ml

Error-recovering streaming HTML5 and XML parsers
http://aantron.github.io/markup.ml
MIT License
146 stars 16 forks source link

pretty-printing xml can change its meaning #78

Closed v-gb closed 5 months ago

v-gb commented 8 months ago

Markup.pretty_print knows which whitespace is significant in html. However for xml, it treats all whitespace as insignificant, which means that the pretty printed xml can have a different meaning from the initial xml. For instance, xml in .docx files store bits of the document's texts as elements like <w:t xml:space="preserve">a </w:t>, so any change of whitespace there changes the document.

It'd be nice for the library to behave more nicely. My use case is pretty printing xml in tests so it's easily diffed.

I wrote a couple of change to trim and pretty_print:

But I don't know much about xml, so maybe it's best to discuss first. The first one seems correct, but the second allows more control (in my case, leave all <w:t> elements alone, not just the ones that have xml:space="preserve" on them for instance, which would be easier to read).

Any thoughts about what's best?

aantron commented 8 months ago

As far as I know, the significance of whitespace is decided by each XML format. As a result, Markup can't reliably make decisions about whether whitespace is significant or not, and Markup's pretty_print for XML is meant essentially only for debugging purposes and does not aim to preserve the semantics of documents.

I thought the best way would be for users to define their own pretty-printers if they would like to pretty-print their formats in a way that preserves the meaning of whitespace. There may be other reasons for doing this too, as pretty_print is very basic (due, again, to not knowing anything about the meaning of what it is pretty-printing).

v-gb commented 8 months ago

Even if you're debugging (I was testing, but that's not fundamentally different), "write your own pretty printer" is a non-trivial hurdle, especially if you're not very familiar with the library, and you might run into bugs of your own making.

The xml spec says, talking about the attribute xml:space:

the value "preserve" indicates the intent that applications preserve all the white space.

Now whitespace might be significant regardless even if it's not marked as such, so it may be better to ask the user to pass in a ~preserve_space_below:(name -> attrs -> bool) function to Markup.pretty_print, but that's much easier for the user than writing a custom pretty printer.

aantron commented 8 months ago

It would probably be useful to add a ?preserve_whitespace argument as you suggest, and expose default implementations for HTML and XML, with the XML one respecting xml:space="preserve". The default value of ?preserve_whitespace would be the HTML implementation for backwards compatibility and the most common usage, but it could also be improved slightly by examining the doctype if it is present in the signal stream.