elifesciences / elife-tools

Python library for parsing eLife article XML data.
MIT License
15 stars 7 forks source link

Comment tag fix #305

Closed gnott closed 5 years ago

gnott commented 5 years ago

When I tested converting a new XML format to JSON output, in issue https://github.com/elifesciences/elife-crossref-feed/issues/123, and I compared the input and output very carefully, I noticed some text in the output that was originally inside XML comment tags.

The convention in this project was when converting XML to HTML, the comment tags and the content inside them was to be stripped out.

That seemed to work in the test case examples for table-wrap. The content that was not getting stripped out was in a <p> tag.

Following all the parser calls, I tracked down what I think is the issue. When a BeautifulSoup tag is converted to a string (which preserves inline tagging, and is then further processed), if the tag is a Comment tag then it gives you the text but removes the comment tags.

The intention of the node_contents_str() function in this project is to return the full string of the tag and its children, and it was not including the comments start and end tags.

I expanded the simple one-line string join function in node_contents_str() to check if the tag is a bs4.Comment object, and if so then add back the comment tags, otherwise it will concatenate the contents as it normally was doing.

When in the paragraph rendering it returns the proper and expected full tagged content, then the function that strips out comment tags works correctly, so the comments are not included in the HTML output.

I added simple test case to cover the logic for paragraph rendering, which I think will also apply to other content blocks that are converted in this way.

gnott commented 5 years ago

Coverage decreased (-8.0e-05%) - oh no!! :)

coveralls commented 5 years ago

Coverage Status

Coverage increased (+0.002%) to 99.559% when pulling 010cc7b0b5fd44c381d2bec78e7bdbfe4d6b2442 on comment-tag-fix into fb26f4156cfca73e448bcd431d2cbb396f2034a3 on develop.