bitextor / pdf-extract

PDF parser and converter to HTML
GNU General Public License v3.0
83 stars 14 forks source link

Bad redirection of kenlm stderr #50

Closed Proyag closed 4 years ago

Proyag commented 4 years ago

https://github.com/bitextor/pdf-extract/blob/b24fc2df0e7b3e11a1d08cecd7560fe6b83ef2f8/src/pdfextract/SentenceJoin.java#L92 redirects sentence-join subprocess stderr to stdout.

sentence-join stderr gets stuff like

This binary file contains trie with quantization and array-compressed pointers.

from kenlm.

https://github.com/bitextor/pdf-extract/blob/b24fc2df0e7b3e11a1d08cecd7560fe6b83ef2f8/src/pdfextract/SentenceJoin.java#L103-L107 interprets these outputs as errors starting sentence-join and throws an exception that looks like:

java.lang.Exception: This binary file contains This binary file contains trie with quantization and array-compressed pointerstrie with quantization and array-compressed pointers..

    at pdfextract.SentenceJoin.start(SentenceJoin.java:106)
    at pdfextract.PDFExtract.sentenceJoin(PDFExtract.java:1579)
    at pdfextract.PDFExtract.sentenceJoin(PDFExtract.java:1083)
    at pdfextract.PDFExtract.Extract(PDFExtract.java:384)
ramoelee commented 4 years ago

Hi @Proyag , appreciate your help to provide the pdf file that hit the above error.

Thank you.

Proyag commented 4 years ago

I can't reproduce this deterministically, so my diagnosis was probably over-simplified, but this pops up when processing large numbers of files, for example while running many instances of bitextor-warc2htmlwarc.py in parallel.

I can work around this issue for now by redirecting kenlm error output to /dev/null here - I'll produce a reproducible example if I run into it again.