Capitains / HookTest

Testing script for Hook
Mozilla Public License 2.0
3 stars 3 forks source link

Exception on invalid xml. #145

Open rillian opened 5 years ago

rillian commented 5 years ago

Some logging output got into my tei files, and hooktest asserts rather than reporting the error:

  File "${HOME}/HookTest/HookTest/capitains_units/cts.py", line 434, in auto_rng
    xml = parse(self.path)
  File "src/lxml/etree.pyx", line 3435, in lxml.etree.parse
  File "src/lxml/parser.pxi", line 1840, in lxml.etree._parseDocument
  File "src/lxml/parser.pxi", line 1866, in lxml.etree._parseDocumentFromURL
  File "src/lxml/parser.pxi", line 1770, in lxml.etree._parseDocFromFile
  File "src/lxml/parser.pxi", line 1163, in lxml.etree._BaseParser._parseDocFromFile
  File "src/lxml/parser.pxi", line 601, in lxml.etree._ParserContext._handleParseResultDoc
  File "src/lxml/parser.pxi", line 711, in lxml.etree._handleParseResult
  File "src/lxml/parser.pxi", line 640, in lxml.etree._raiseParseError
  File "tests/repo1/data/hafez/divan/hafez.divan.perseus-eng1.xml", line 1
lxml.etree.XMLSyntaxError: Start tag expected, '<' not found, line 1, column 1

One may reproduce by prepending the string 'Garbage text\n' to e.g. the beginning of tests/repo1/data/hafez/divan/hafez.divan.perseus-eng1.xml.

The XMLSyntaxError is hidden by the imap_unordered call through the threadpool and presents instead as a MaybeEncodingError because lxml.etree can't pickle its _ListErrorLog. Flattening the parallel iterator to a serial one reveals the underlying issue.

rillian commented 5 years ago

The problem occurs with general xml parsing failures. E.g. the unrecognized &sect; entity on line 776 of tlg0004.tlg001.perseus-eng1.xml from canonical-greekLit.

PonteIneptique commented 5 years ago

Yes, this seems like something that would need work. The XML parsing vs. Capitains Parsing is something that has remained in the codebase for a long time. Feel free to propose a fix, including by creating a new exception :)