Closed zuny26 closed 3 years ago
Hi @zuny26 , please help to update the source code and reinstall PDFExtract.jar for resolve the issue.
Thanks
Hi @ramoelee Thank you, the issue is solved now for the batch file use case. However if this function is used: https://github.com/bitextor/pdf-extract/blob/425afb6fe2d42fa908240ae8784c631aac555e0e/src/pdfextract/PDFExtract.java#L356 the issue still persists. It would be nice to fix this function as well, because this is what we use for our python wrapper and c++ wrapper (currently in development)
Hi @zuny26 , please help to update the source code and reinstall PDFExtract.jar for resolve the issue.
Thanks
Yes, it seems to be working now, thank you! The only remaining problem that I see is that when verbose mode is activated, PDFExtract is printing a lot of lines that just say "null"
Hi @zuny26 , please help to update the source code and reinstall PDFExtract.jar for resolve the issue.
Thanks
When using PDFExtract-2.0.jar with
-B
option to process a list of files, sentence join model is only applied to the first file. After that, PDFExtract writes the following error message to stdout for each line that is passed to sentence join:(where
...
is the content of the line)Processing same files separately works fine, so it looks like sentence join process is closed after finishing with the first file.
My config file is: pdfextract.json.txt Sentence join models downloaded from http://data.statmt.org/paracrawl/sentence-join/ The PDFs that I tested with are: one and two