bitextor / pdf-extract

PDF parser and converter to HTML
GNU General Public License v3.0
83 stars 14 forks source link

Branch poppler-rewrite does mark all sentences as lang="en" if protobuf not found #26

Closed lpla closed 4 years ago

lpla commented 4 years ago

After fixing #22 and #25, if I run the uploaded .jar in runnable-jar/PDFExtract.jar in poppler-rewrite in a 'modern' OS (>16.04), instead of getting an error or warning like #22, now I get all sentences marked as lang="en".

For example, using the same example PDF shown in #25:

java -jar runnable-jar/PDFExtract.jar -I ~/forcada16j.pdf -O test

Shows this in a 16.04 machine with libprotobuf.so.9:

<html>
<head>
<defaultLang abbr="en" />
<languages>
<language abbr="en" percent="96.062996" />
<language abbr="fy" percent="1.5748031" />
<language abbr="la" percent="0.78740156" />
<language abbr="da" percent="0.78740156" />
<language abbr="ca" percent="0.78740156" />
</languages>
</head>
<body>
<div id="page1" class="page">
<p id="page1p1" lang="en" fontname="LGPJEB+NimbusRomNo9L-Regu">
Baltic J. Modern Computing, Vol. 4 (2016), No. 2, pp. 152-164
</p>
...
<p id="page1p4" lang="ca" fontname="LGPJEB+NimbusRomNo9L-Regu">
Departament de Llenguatges i Sistemes Inform`atics, Universitat d'Alacant, E-03071 Alacant, Spain {mlf,mespla,japerez}@ua.es
</p>
....

And this in a non-16.04 machine:

<html>
<head>
<defaultLang abbr="en" />
<languages>
</languages>
</head>
<body>
<div id="page1" class="page">
<p id="page1p1" lang="en" fontname="LGPJEB+NimbusRomNo9L-Regu">
Baltic J. Modern Computing, Vol. 4 (2016), No. 2, pp. 152-164
</p>
...
<p id="page1p4" lang="en" fontname="LGPJEB+NimbusRomNo9L-Regu">
Departament de Llenguatges i Sistemes Inform`atics, Universitat d'Alacant, E-03071 Alacant, Spain {mlf,mespla,japerez}@ua.es
</p>
...

I think a warning should be shown. Or maybe just don't mark all sentences as English if there is no language detection running because of dependency issues or runtime errors.

dionwiggins commented 4 years ago

What is a non-16.04 machine? Can you define a little more? The docs state that it supports Ubuntu >= 16.04 or CentOS >= 7 or Debian >= 9.

lpla commented 4 years ago

My "non-16.04" machine is a Ubuntu 18.04. But this is reproducible in any machine without libprotobuf.so.9

dionwiggins commented 4 years ago

This has been addressed for all compatible OS. Closing