Open dzeban opened 7 years ago
tostring
calls ElementTree.write
method for string serialization.
ElementTree.write
is a wrapper of _serialize_xml
/_serialize_html
/_serialize_text
.
Serialize methods iterates over element tree and invokes write
callback that is passed from caller and which is NOT ElementTree.write
.
write
callback is obtained from _get_writer
and usually it's a file.write
or io.BufferedWriter
for non-unicode encodings.
In case of tostring
it will write to the stream
that is io.StringIO
of io.BytesIO
.
Options:
tostring
function make indent_stream
- wrapper over StringIO
/BytesIO
- that will handle indentation._serialize_xml
, _serialize_html
, _serialize_text
).Second is option is more error-prone and less sane because serializers called from write
method and the write
method is a etree API and can be invoked in many unrelated to the pretty printing cases.
Writing a StringIO wrapper that does indentation is ugly - it has no information about XML nodes, so to pretty print it has to parse the string again. Indentation wrapper write
method accepts strings that may contain a couple of nested elements, it may contain an incomplete element to track the indentation level it has to parse the incoming string again. That sounds more like a standalone pretty printer that can work like a line filter, e.g. jq
.
So it's more feasible to add indentation to the serialize methods.
So, the latest c12474b commit has the working indentation version.
What's left:
_serialize_xml
_serialize_html
and _serialize_text
Implemented conditional newlines that saves compatibility with old formatting. This means that old code using ElementTree will work as before. Yay!
There is an open issue at bpo - https://bugs.python.org/issue14465
Here is the PR: https://github.com/python/cpython/pull/4016
Need to add NEWS - https://devguide.python.org/committing/#what-s-new-and-news-entries
Windows CI failed on test for html, because reference file test.xml.pretty is stored with LF endings, while CPython converts '\n' symbols in pretty printing code to the '\r\n' according to the universal newline.
Rewrite test_pretty_print_html
as test_pretty_print_xml
but with HTML specifics. Drop test.xml.pretty file.
Reply on Serhiy thoughts from bpo.
There are 2 points he made there. First, is handling of significant whitespaces, second is performance.
In XML whitespaces are not significant unless xml:space
attribute is set to preserve
. The thing is that when it's set, it applies to all the children elements unless it's overridden.
So XML serializer should respect this attribute and track it when traversing the XML tree.
In HTML there are tags that must preserve whitespaces, e.g. pre
, script
, style
. The rules here might be complex, so need to check what minidom, html tidy, lxml and BeautifulSoup tools do. Hope it'll be as simple as making a list of tags where we'll print the content of the element as is - this is already done in ElementTree for script
and style
tags.
Serhiy proposed a patch to speed up the serialization in bpo-25881. Need to review it and make further speedups.
What is Modules/_elementtree.c
?
There is a builtin module xml.etree that is very nice, but doesn't have pretty printing. So people use either minidom or even lxml.
It would be nice to have pretty printing in
xml.etree.ElementTree.tostring
method. Addpretty_print=True|False
as an argument.