Closed lpla closed 4 years ago
After analyzing cld3-java
code (https://github.com/xondre09/cld3-Java) and the way it is used in poppler-rewrite
branch, it looks like the cld3 .so
library included in poppler-rewrite
.jar
(needed for cld3-java
) is compiled with protobuf
2.6.1 (included in apt
packages in Ubuntu 16.04), so computers without this exact library version installed won't be able to run poppler-rewrite
code without recompiling cld3-java
C++ dependency. Listing supported distros as of today, only Ubuntu 16.04 include protobuf 2.6/libprotobuf9
in their packages.
That means that including a .jar in the repository is now completely impractical unless you use Ubuntu 16.04 and you have libprotobuf9
installed.
We need an automated process to build pdf-extract
.jar
file without using a complete IDE like Eclipse (as 'explained' here https://github.com/bitextor/pdf-extract/blob/master/INSTALL.md).
I suggest ant
, mvn
or gradle
.
Otherwise, this code cannot be used in bitextor
or python-pdfextract
.
Thanks, Leo, Mui is working through this. She found that it needs to be compiled per platform as noted and is preparing instructions and details. I will make sure she checks out the tools above. CLD3 is quite tricky to get working, so it may be worth packaging in a Docker container as a service at some point.
A quick update. Mui is working on this. CLD3 is proving quite difficult to deal with. She also looked at CLD2 as an option as the difference in accuracy is minimal. What she did note is that CLD3 is about 10 times faster than CLD2 which some may find of interest. Hopefully, she can resolve in the next day or so.
Did you test accuracy in all 30 Paracrawl languages (EU 24 + 3 official in Spain + 2 Norwegians + Icelandic)? I seem to remember people in Alacant found that some language pairs are told apart much better by CLD3 than CLD2.
Marta (@mbanon) is performing a comparison between some popular language detectors: https://github.com/paracrawl/pipeline_evaluation_data/tree/master/lang_ident
This test still misses language separated precission and performance results, but regarding general performance, CLD2 is much faster than CLD3: https://github.com/paracrawl/pipeline_evaluation_data/blob/master/lang_ident/profiling.txt
@mlforcada is right. UAlacant have tested specifically Galician precission in CLD2 vs CLD3 some months ago and CLD3 detected Galician better. So probably CLD3 works better than CLD2 for smaller languages at a performance cost.
Hi,
My team informed me that CLD3 was many times faster than CLD2. I will validate this information and come back with more clarity when my team is back in the office.
Regards,
Dion Wiggins Founder and CTO Omniscien Technologies
Phone: +66 (8) 7086 3353 Fax: +66 (2) 662 4728, +66 (2) 662 4727 Skype: dionwiggins Email: dion.wiggins@omniscien.com Web: http://www.omniscien.com
NOTICE: This e-mail (including all information transmitted with it) is for the intended addressee only. It may contain information that is confidential, proprietary and/or legally privileged. No confidentiality, ownership right or privilege is waived or lost by any mistransmission, redirection or interception. No one other than the intended addressee may read, print, store, copy, forward or act in reliance upon this e-mail. If you are not the intended addressee: (a) any use, dissemination, printing or copying of this e-mail is strictly prohibited and may be a breach of confidence, and (b) kindly notify the sender by e-mail immediately and delete and destroy all copies of this e-mail in your possession.
From: Leopoldo Pla notifications@github.com Sent: Tuesday, February 18, 2020 9:34 PM To: bitextor/pdf-extract pdf-extract@noreply.github.com Cc: dionwiggins dion.wiggins@omniscien.com; Assign assign@noreply.github.com Subject: Re: [bitextor/pdf-extract] Exception in thread "main" java.lang.UnsatisfiedLinkError: /tmp/native-forcld3-350533629840224/libforcld3.so: libprotobuf.so.9: cannot open shared object file: No such file or directory (#22)
Marta (@mbanon) is performing a comparison between some popular language detectors: https://github.com/paracrawl/pipeline_evaluation_data/tree/master/lang_ident This test still misses language separated precission and performance results, but regarding general performance, CLD2 is much faster than CLD3: https://github.com/paracrawl/pipeline_evaluation_data/blob/master/lang_ident/profiling.txt @mlforcada is right. UAlacant have tested specifically Galician precission in CLD2 vs CLD3 some months ago and CLD3 detected Galician better. So probably CLD3 works better than CLD2 for smaller languages at a performance cost. — You are receiving this because you were assigned. Reply to this email directly, view it on GitHub, or unsubscribe.
Mui has been testing the installer script on Fedora, Ubuntu, Redhat and CentOS. It is now looking good and will be released in the next few days.
And the code can currently be found in branch. . .
Resolved with the update yesterday
Now has a full installer that makes OS-specific adjustments.
Error when running compiled .jar:
Source code is also needed for this.