bitextor / pdf-extract

PDF parser and converter to HTML
GNU General Public License v3.0
83 stars 14 forks source link

java.lang.Exception: This binary file contains trie with quantization and array-compressed pointers. #55

Closed lpla closed 4 years ago

lpla commented 4 years ago

I downloaded the sentence-join model from http://data.statmt.org/paracrawl/sentence-join/en/ and tried to run with a simple PDF I got working without this model (https://www.dlsi.ua.es//~mlf/docum/forcada16j.pdf) and default config file (PDFExtract.json) Using code commit before #54 fix, I got this error:

java.lang.Exception: This binary file contains trie with quantization and array-compressed pointers.

        at pdfextract.SentenceJoin.start(SentenceJoin.java:110)
        at pdfextract.PDFExtract.sentenceJoin(PDFExtract.java:1706)
        at pdfextract.PDFExtract.sentenceJoin(PDFExtract.java:1130)
        at pdfextract.PDFExtract.Extract(PDFExtract.java:391)

With #54 fix only this non-specific warning was shown in the output:

<warnings>
<warning>
<method>sentenceJoin</method>
<details>
        <message><![CDATA[Fail loading model for language: en]]></message>
        <suggestion><![CDATA[Please verify the "sentencejoin_model" value of language {en} in configuration file.]]></suggestion>
</details>
</warning>
</warnings>
ramoelee commented 4 years ago

Hi @lpla , for work around, please follow the instruction here. I will find the root cause and fix it after I can simulate the error message as above.

lpla commented 4 years ago

Hi. I didn't use Bitextor for this example. Only run this command with the PDF I mentioned: java -jar target/PDFExtract-2.0.jar -I ~/forcada16j.pdf -O test

with the attached JSON config file (compressed given Github format restrictions) and the data I downloaded from statmt as mentioned in OP.

PDFExtract.zip

ramoelee commented 4 years ago

Hi, I still cannot simulate it, no matter the below commant has been used.

java -jar PDFExtract-2.0.jar -I "/home/ramoslee/work/pdfExtract/testing/forcada16j.pdf" -O test

with the attached JSON config file and the result as attached was retrun.

pdfExtract.zip

lpla commented 4 years ago

That result you attached is not an HTML as I was getting, it is plain text. Also, as mentioned, I am using the penultimate master commit with git checkout 56f327a26e6b1bf4ad137d2c4c86c6e0c5402448

ramoelee commented 4 years ago

you may use the below command to get the html result:

-O <output_file> specifies the path to the output HTML file after extraction

java -jar PDFExtract-2.0.jar -I "/home/ramoslee/work/pdfExtract/testing/forcada16j.pdf" -O test.html

Rusult: html_result.zip

lpla commented 4 years ago

Sorry, I was opening your output file with an editor that actually interpreted the HTML content. You were right.

Which version or commit of kenlm are you using?

ramoelee commented 4 years ago

I used currently version on git (commit 209ceb61df6a76d844af36c53293aa4695b85984). May​ you try to redownload setup.sh and run below commands to reinstall the PDFExtract-2.0.jar

sudo bash setup.sh

lpla commented 4 years ago

I was talking about kenlm version, which is the only part that it is not installed with setup.sh

ramoelee commented 4 years ago

Sorry, I tested with KenLM current version and hit the same error with you, for work around:

  1. you may use Moses instead of KenLM.
  2. Or you may use KenLM but you have redirect stderror to "/dev/null" as below in sentence-join.py as below attachment. sentence-join.zip