XLSForm / pyxform

A Python package to create XForms for ODK Collect.
BSD 2-Clause "Simplified" License
75 stars 133 forks source link

150: don't use regex to clean up XML following pretty-printing #681

Closed lindsay-stevens closed 4 months ago

lindsay-stevens commented 4 months ago

Closes #150

Default xml.minidom pretty-printer treats each node in a mixed-content element as a separate node. As a result it adds newlines and indents for each text node and output element, which is ugly and wrong. Then survey.py cleans that up with some regex.

This change copies and customises the default writexml implementation from minidom, and skips newline/indent for text nodes. It also includes conditional whitespace to match the previous processing, in case Collect or Enketo expected it to be exactly that way.

Added a test to specifically enumerate various output/text mixture permutations that appear in a variety of existing tests.

Why is this the best possible solution? Were any other approaches considered?

I think not outputting the undesirable whitespace in the first place is better than cleaning it up afterwards. Copying writexml adds a bit of complexity to the code base but I think the trade off vs the regex is worth it. Being able to do this without a new dependency is preferable. It may also allow more flexibility around users inputting whitespace in their question labels, without pyxform mangling it.

What are the regression risks?

While the added test should provide coverage, it's possible that some combination of text and references doesn't produce exactly the same white-spaced output as before, and assuming Collect or Enketo are sensitive to that it might change the output. But that risk seems pretty small. I'm not sure that the whitespace is actually significant in the first place.

Does this change require updates to documentation? If so, please file an issue here and include the link below.

No

Before submitting this PR, please make sure you have: