bitextor / pdf-extract

PDF parser and converter to HTML
GNU General Public License v3.0
83 stars 14 forks source link

Sentence join fails when using a batch file #57

Closed zuny26 closed 3 years ago

zuny26 commented 3 years ago

When using PDFExtract-2.0.jar with -B option to process a list of files, sentence join model is only applied to the first file. After that, PDFExtract writes the following error message to stdout for each line that is passed to sentence join:

execute sentence join [es] failed. ... ,Stream closed 

(where ... is the content of the line)

Processing same files separately works fine, so it looks like sentence join process is closed after finishing with the first file.

My config file is: pdfextract.json.txt Sentence join models downloaded from http://data.statmt.org/paracrawl/sentence-join/ The PDFs that I tested with are: one and two

ramoelee commented 3 years ago

Hi @zuny26 , please help to update the source code and reinstall PDFExtract.jar for resolve the issue.

Thanks

zuny26 commented 3 years ago

Hi @ramoelee Thank you, the issue is solved now for the batch file use case. However if this function is used: https://github.com/bitextor/pdf-extract/blob/425afb6fe2d42fa908240ae8784c631aac555e0e/src/pdfextract/PDFExtract.java#L356 the issue still persists. It would be nice to fix this function as well, because this is what we use for our python wrapper and c++ wrapper (currently in development)

ramoelee commented 3 years ago

Hi @zuny26 , please help to update the source code and reinstall PDFExtract.jar for resolve the issue.

Thanks

zuny26 commented 3 years ago

Yes, it seems to be working now, thank you! The only remaining problem that I see is that when verbose mode is activated, PDFExtract is printing a lot of lines that just say "null"

ramoelee commented 3 years ago

Hi @zuny26 , please help to update the source code and reinstall PDFExtract.jar for resolve the issue.

Thanks