Closed lpla closed 4 years ago
Hi @lpla , for work around, please follow the instruction here. I will find the root cause and fix it after I can simulate the error message as above.
Hi. I didn't use Bitextor for this example. Only run this command with the PDF I mentioned:
java -jar target/PDFExtract-2.0.jar -I ~/forcada16j.pdf -O test
with the attached JSON config file (compressed given Github format restrictions) and the data I downloaded from statmt as mentioned in OP.
Hi, I still cannot simulate it, no matter the below commant has been used.
java -jar PDFExtract-2.0.jar -I "/home/ramoslee/work/pdfExtract/testing/forcada16j.pdf" -O test
with the attached JSON config file and the result as attached was retrun.
That result you attached is not an HTML as I was getting, it is plain text. Also, as mentioned, I am using the penultimate master commit with git checkout 56f327a26e6b1bf4ad137d2c4c86c6e0c5402448
you may use the below command to get the html result:
-O <output_file> specifies the path to the output HTML file after extraction
java -jar PDFExtract-2.0.jar -I "/home/ramoslee/work/pdfExtract/testing/forcada16j.pdf" -O test.html
Rusult: html_result.zip
Sorry, I was opening your output file with an editor that actually interpreted the HTML content. You were right.
Which version or commit of kenlm are you using?
I used currently version on git (commit 209ceb61df6a76d844af36c53293aa4695b85984). May you try to redownload setup.sh and run below commands to reinstall the PDFExtract-2.0.jar
sudo bash setup.sh
I was talking about kenlm
version, which is the only part that it is not installed with setup.sh
Sorry, I tested with KenLM current version and hit the same error with you, for work around:
I downloaded the sentence-join model from http://data.statmt.org/paracrawl/sentence-join/en/ and tried to run with a simple PDF I got working without this model (https://www.dlsi.ua.es//~mlf/docum/forcada16j.pdf) and default config file (PDFExtract.json) Using code commit before #54 fix, I got this error:
With #54 fix only this non-specific warning was shown in the output: