bitextor / pdf-extract

PDF parser and converter to HTML
GNU General Public License v3.0
83 stars 14 forks source link

Exception in thread "main" java.lang.UnsatisfiedLinkError: /tmp/native-forcld3-350533629840224/libforcld3.so: libprotobuf.so.9: cannot open shared object file: No such file or directory #22

Closed lpla closed 4 years ago

lpla commented 4 years ago

Error when running compiled .jar:

Exception in thread "main" java.lang.UnsatisfiedLinkError: /tmp/native-forcld3-350533629840224/libforcld3.so: libprotobuf.so.9: cannot open shared object file: No such file or directory

Source code is also needed for this.

lpla commented 4 years ago

After analyzing cld3-java code (https://github.com/xondre09/cld3-Java) and the way it is used in poppler-rewrite branch, it looks like the cld3 .so library included in poppler-rewrite .jar (needed for cld3-java) is compiled with protobuf 2.6.1 (included in apt packages in Ubuntu 16.04), so computers without this exact library version installed won't be able to run poppler-rewrite code without recompiling cld3-java C++ dependency. Listing supported distros as of today, only Ubuntu 16.04 include protobuf 2.6/libprotobuf9 in their packages.

That means that including a .jar in the repository is now completely impractical unless you use Ubuntu 16.04 and you have libprotobuf9 installed.

We need an automated process to build pdf-extract .jar file without using a complete IDE like Eclipse (as 'explained' here https://github.com/bitextor/pdf-extract/blob/master/INSTALL.md).

I suggest ant, mvn or gradle.

Otherwise, this code cannot be used in bitextor or python-pdfextract.

dionwiggins commented 4 years ago

Thanks, Leo, Mui is working through this. She found that it needs to be compiled per platform as noted and is preparing instructions and details. I will make sure she checks out the tools above. CLD3 is quite tricky to get working, so it may be worth packaging in a Docker container as a service at some point.

dionwiggins commented 4 years ago

A quick update. Mui is working on this. CLD3 is proving quite difficult to deal with. She also looked at CLD2 as an option as the difference in accuracy is minimal. What she did note is that CLD3 is about 10 times faster than CLD2 which some may find of interest. Hopefully, she can resolve in the next day or so.

mlforcada commented 4 years ago

Did you test accuracy in all 30 Paracrawl languages (EU 24 + 3 official in Spain + 2 Norwegians + Icelandic)? I seem to remember people in Alacant found that some language pairs are told apart much better by CLD3 than CLD2.

lpla commented 4 years ago

Marta (@mbanon) is performing a comparison between some popular language detectors: https://github.com/paracrawl/pipeline_evaluation_data/tree/master/lang_ident

This test still misses language separated precission and performance results, but regarding general performance, CLD2 is much faster than CLD3: https://github.com/paracrawl/pipeline_evaluation_data/blob/master/lang_ident/profiling.txt

@mlforcada is right. UAlacant have tested specifically Galician precission in CLD2 vs CLD3 some months ago and CLD3 detected Galician better. So probably CLD3 works better than CLD2 for smaller languages at a performance cost.

dionwiggins commented 4 years ago

Hi,

My team informed me that CLD3 was many times faster than CLD2. I will validate this information and come back with more clarity when my team is back in the office.

Regards,

Dion Wiggins Founder and CTO Omniscien Technologies

Phone: +66 (8) 7086 3353 Fax: +66 (2) 662 4728, +66 (2) 662 4727 Skype: dionwiggins Email: dion.wiggins@omniscien.com Web: http://www.omniscien.com

NOTICE: This e-mail (including all information transmitted with it) is for the intended addressee only. It may contain information that is confidential, proprietary and/or legally privileged. No confidentiality, ownership right or privilege is waived or lost by any mistransmission, redirection or interception. No one other than the intended addressee may read, print, store, copy, forward or act in reliance upon this e-mail. If you are not the intended addressee: (a) any use, dissemination, printing or copying of this e-mail is strictly prohibited and may be a breach of confidence, and (b) kindly notify the sender by e-mail immediately and delete and destroy all copies of this e-mail in your possession.

From: Leopoldo Pla notifications@github.com Sent: Tuesday, February 18, 2020 9:34 PM To: bitextor/pdf-extract pdf-extract@noreply.github.com Cc: dionwiggins dion.wiggins@omniscien.com; Assign assign@noreply.github.com Subject: Re: [bitextor/pdf-extract] Exception in thread "main" java.lang.UnsatisfiedLinkError: /tmp/native-forcld3-350533629840224/libforcld3.so: libprotobuf.so.9: cannot open shared object file: No such file or directory (#22)

Marta (@mbanon) is performing a comparison between some popular language detectors: https://github.com/paracrawl/pipeline_evaluation_data/tree/master/lang_ident This test still misses language separated precission and performance results, but regarding general performance, CLD2 is much faster than CLD3: https://github.com/paracrawl/pipeline_evaluation_data/blob/master/lang_ident/profiling.txt @mlforcada is right. UAlacant have tested specifically Galician precission in CLD2 vs CLD3 some months ago and CLD3 detected Galician better. So probably CLD3 works better than CLD2 for smaller languages at a performance cost. — You are receiving this because you were assigned. Reply to this email directly, view it on GitHub, or unsubscribe.

dionwiggins commented 4 years ago

Mui has been testing the installer script on Fedora, Ubuntu, Redhat and CentOS. It is now looking good and will be released in the next few days.

kpu commented 4 years ago

And the code can currently be found in branch. . .

dionwiggins commented 4 years ago

Resolved with the update yesterday

Now has a full installer that makes OS-specific adjustments.

https://github.com/bitextor/pdf-extract/issues/25