elifesciences / decision-letter-parser

Parse docx file containing decision letter and author response content and produce output in other formats
MIT License
0 stars 0 forks source link

Bug parsing sample file Chi 44816.docx, extra bold tags #49

Closed gnott closed 4 years ago

gnott commented 4 years ago

As a test for video content parsing, I tried generating XML output from the sample file Chi 44816.docx. There's an error in build.py due to XML tagging,

xml.etree.ElementTree.ParseError: mismatched tag: line 1, column 1249

It looks like pandoc JATS output for the author response is adding extra <bold> tags around paragraphs. I cannot find a quick way to fix it, even after editing the .docx file content to see why it is producing the odd output.

This may need to be checked again later. As far as I can tell, it is an issue with pandoc itself and the JATS output it produces.

gnott commented 4 years ago

Tested with pandoc version 2.7.3 in the Docker image versus 2.7 and this bug does not appear in 2.7.3.

There is, however, a newline character issue in the tests when using more recent versions of Pandoc. I think it will also be better to set --wrap=none on the docker call when parsing the .docx file and then adjust the test fixtures accordingly for to get the cleanest output.

gnott commented 4 years ago

After changes are applied to support newer pandoc versions, the bug has re-appeared. I think it has something to do with the XML cleaning procedures in this library and a particular use case can be extracted from this example .docx file.

gnott commented 4 years ago

Fixed in merging of PR https://github.com/elifesciences/decision-letter-parser/pull/70.