dzeban / cpython

The Python programming language
https://www.python.org/
Other
0 stars 0 forks source link

xml.etree pretty printing #2

Open dzeban opened 7 years ago

dzeban commented 7 years ago

There is a builtin module xml.etree that is very nice, but doesn't have pretty printing. So people use either minidom or even lxml.

It would be nice to have pretty printing in xml.etree.ElementTree.tostring method. Add pretty_print=True|False as an argument.

dzeban commented 6 years ago

tostring calls ElementTree.write method for string serialization. ElementTree.write is a wrapper of _serialize_xml/_serialize_html/_serialize_text.

Serialize methods iterates over element tree and invokes write callback that is passed from caller and which is NOT ElementTree.write.

write callback is obtained from _get_writer and usually it's a file.write or io.BufferedWriter for non-unicode encodings.

In case of tostring it will write to the stream that is io.StringIO of io.BytesIO.

dzeban commented 6 years ago

Options:

  1. In tostring function make indent_stream - wrapper over StringIO/BytesIO - that will handle indentation.
  2. Add indentation to the serializer methods (_serialize_xml, _serialize_html, _serialize_text).

Second is option is more error-prone and less sane because serializers called from write method and the write method is a etree API and can be invoked in many unrelated to the pretty printing cases.

dzeban commented 6 years ago

Writing a StringIO wrapper that does indentation is ugly - it has no information about XML nodes, so to pretty print it has to parse the string again. Indentation wrapper write method accepts strings that may contain a couple of nested elements, it may contain an incomplete element to track the indentation level it has to parse the incoming string again. That sounds more like a standalone pretty printer that can work like a line filter, e.g. jq.

So it's more feasible to add indentation to the serialize methods.

dzeban commented 6 years ago

So, the latest c12474b commit has the working indentation version.

What's left:

dzeban commented 6 years ago

Implemented conditional newlines that saves compatibility with old formatting. This means that old code using ElementTree will work as before. Yay!

dzeban commented 6 years ago

There is an open issue at bpo - https://bugs.python.org/issue14465

dzeban commented 6 years ago

Here is the PR: https://github.com/python/cpython/pull/4016

dzeban commented 6 years ago

Need to add NEWS - https://devguide.python.org/committing/#what-s-new-and-news-entries

dzeban commented 6 years ago

Windows CI failed on test for html, because reference file test.xml.pretty is stored with LF endings, while CPython converts '\n' symbols in pretty printing code to the '\r\n' according to the universal newline.

Rewrite test_pretty_print_html as test_pretty_print_xml but with HTML specifics. Drop test.xml.pretty file.

dzeban commented 6 years ago

Reply on Serhiy thoughts from bpo.

There are 2 points he made there. First, is handling of significant whitespaces, second is performance.

Significant whitespaces

XML

In XML whitespaces are not significant unless xml:space attribute is set to preserve. The thing is that when it's set, it applies to all the children elements unless it's overridden.

So XML serializer should respect this attribute and track it when traversing the XML tree.

HTML

In HTML there are tags that must preserve whitespaces, e.g. pre, script, style. The rules here might be complex, so need to check what minidom, html tidy, lxml and BeautifulSoup tools do. Hope it'll be as simple as making a list of tags where we'll print the content of the element as is - this is already done in ElementTree for script and style tags.

Performance

Serhiy proposed a patch to speed up the serialization in bpo-25881. Need to review it and make further speedups.

What is Modules/_elementtree.c?