Closed Proyag closed 4 years ago
Hi @Proyag , appreciate your help to provide the pdf file that hit the above error.
Thank you.
I can't reproduce this deterministically, so my diagnosis was probably over-simplified, but this pops up when processing large numbers of files, for example while running many instances of bitextor-warc2htmlwarc.py in parallel.
I can work around this issue for now by redirecting kenlm error output to /dev/null here - I'll produce a reproducible example if I run into it again.
https://github.com/bitextor/pdf-extract/blob/b24fc2df0e7b3e11a1d08cecd7560fe6b83ef2f8/src/pdfextract/SentenceJoin.java#L92 redirects sentence-join subprocess stderr to stdout.
sentence-join stderr gets stuff like
from kenlm.
https://github.com/bitextor/pdf-extract/blob/b24fc2df0e7b3e11a1d08cecd7560fe6b83ef2f8/src/pdfextract/SentenceJoin.java#L103-L107 interprets these outputs as errors starting sentence-join and throws an exception that looks like: