bitextor / pdf-extract

PDF parser and converter to HTML
GNU General Public License v3.0
83 stars 14 forks source link

Show warning if "sentencejoin_model" path or a used file is missing #54

Closed lpla closed 4 years ago

lpla commented 4 years ago

If you provide valid paths for "sentence_join" and "kenlm_path" using PDFExtract.json or arguments, like:

"script" : {
        "sentence_join" : "/home/lpla/sentence-join/sentence-join.py",
        "kenlm_path" : "/home/lpla/kenlm/bin"
},

But an invalid one for the "sentencejoin_model", like:

"name" : "en",
                "config" : {
                        "sentencejoin_model" : "/home/usr/models/toy-model",
                        "join_words" : [
                        ],
                        "absolute_eof" : [
                        ],
                        "normalize" : [
                        ],
                        "repair" : [
                        ]
                }

(I don't have any 'usr' user in /home/)

the kenlm crashes because there is no check for the '*.forward.binlm' file, with this error:

java.lang.Exception: /home/lpla/kenlm/util/file.cc:76 in int util::OpenReadOrThrow(const char*) threw ErrnoException because `-1 == (ret = open(name, 00))'.
No such file or directory while opening /home/usr/models/toy-model.forward.binlm

    at pdfextract.SentenceJoin.start(SentenceJoin.java:110)
    at pdfextract.PDFExtract.sentenceJoin(PDFExtract.java:1706)
    at pdfextract.PDFExtract.sentenceJoin(PDFExtract.java:1130)
    at pdfextract.PDFExtract.Extract(PDFExtract.java:285)
    at Main.main(Main.java:81)

Furthermore, if the kenlm path setting is also invalid (in "kenlm_path" as first example shown above), another error is thrown:

java.lang.Exception: Traceback (most recent call last):
  File "/home/lpla/sentence-join/sentence-join.py", line 231, in <module>
    kenlm_forward = KenLM([kenlm_query,"-b","-n",args.model + ".forward.binlm"])
  File "/home/lpla/sentence-join/sentence-join.py", line 29, in __init__
    self.proc = subprocess.Popen(cmd, stdin=subprocess.PIPE, stdout=subprocess.PIPE)
  File "/usr/lib/python3.6/subprocess.py", line 729, in __init__
    restore_signals, start_new_session)
  File "/usr/lib/python3.6/subprocess.py", line 1364, in _execute_child
    raise child_exception_type(errno_num, err_msg, err_filename)
FileNotFoundError: [Errno 2] No such file or directory: '/home/usr/kenlm/bin/query': '/home/usr/kenlm/bin/query'

    at pdfextract.SentenceJoin.start(SentenceJoin.java:110)
    at pdfextract.PDFExtract.sentenceJoin(PDFExtract.java:1706)
    at pdfextract.PDFExtract.sentenceJoin(PDFExtract.java:1130)
    at pdfextract.PDFExtract.Extract(PDFExtract.java:285)
    at Main.main(Main.java:81)

so another check should be issued before calling kenlm tools.

Although, if "sentence_join" setting is invalid, none of these errors are shown, showing a Warning in the output file:

<warnings>
<warning>
<method>sentenceJoin</method>
<detail><![CDATA[No model for language: en]]></detail>
</warning>
</warnings>

This should be shown in any of the other cases with a more accurate detail of the reason why sentenceJoin is not running.

ramoelee commented 4 years ago

Hi @lpla , the source code has been modified to show a more accurate warning message, please reinstall the PDFExtract.