A few files from our test set cause fatal aborts (SIGABRT), which can't be caught on python. Details follow.
Parsing PDFs page-by-page is very fast, but it takes successively longer for every additional page in a document, and the output for page N contains the content of pages 0 to N-1 as well. This also grows memory usage.
The parser_v2 should accept the loglevel straight at construction time (on __init__), so it doesn't spill logs before user code can call parser.set_loglevel to silence it.
Some observations from testing:
SIGABRT
), which can't be caught on python. Details follow.__init__
), so it doesn't spill logs before user code can callparser.set_loglevel
to silence it.