Closed joewiz closed 8 years ago
Okay, so I would recommend -
Update Tika 1.12
Remove whatever POI version eXist has and replace it with POI 3.13, you will need at least:
http://search.maven.org/remotecontent?filepath=org/apache/poi/poi/3.13/poi-3.13.jar http://search.maven.org/remotecontent?filepath=org/apache/poi/poi-ooxml/3.13/poi-ooxml-3.13.jar http://search.maven.org/remotecontent?filepath=org/apache/poi/poi-scratchpad/3.13/poi-scratchpad-3.13.jar
If you don't know where to put those, just drop them in lib/user, and make sure you have removed any old Tika and or POI versions.
How do I know this you might wonder? I took a look at the POM file for Tika 1.12 which describes the dependencies that are needed. You can see that here: https://repo1.maven.org/maven2/org/apache/tika/tika-parsers/1.12/tika-parsers-1.12.pom
Sadly eXist's build process has no real concept of dependencies and so we get these issues. I keep offering to rewrite the build process, I just need to find the time!
Let me know if that works. If so I will send a PR...
On 10 March 2016 at 09:20, Joe Wicentowski notifications@github.com wrote:
Using the current develop branch, contentextraction functions fail with a java.lang.NoClassDefFoundError error.
Steps to reproduce:
1.
Upload the attached test files:
test.docx https://github.com/eXist-db/exist/files/167258/test.docx
test.xlsx https://github.com/eXist-db/exist/files/167259/test.xlsx
- Submit the following queries in eXide:
let $binary := util:binary-doc('/db/test.xlsx')return contentextraction:get-metadata-and-content($binary)
let $binary := util:binary-doc('/db/test.docx')return contentextraction:get-metadata-and-content($binary)
Resulting error from exist.log:
2016-03-10 09:13:39,318 [eXistThread-53] ERROR (XQueryServlet.java [process]:550) - Could not initialize class org.apache.poi.openxml4j.opc.internal.marshallers.ZipPackagePropertiesMarshaller java.lang.NoClassDefFoundError: Could not initialize class org.apache.poi.openxml4j.opc.internal.marshallers.ZipPackagePropertiesMarshaller at org.apache.poi.openxml4j.opc.OPCPackage.init(OPCPackage.java:161) ~[poi-ooxml-3.12-beta1.jar:3.12-beta1] at org.apache.poi.openxml4j.opc.OPCPackage.
(OPCPackage.java:141) ~[poi-ooxml-3.12-beta1.jar:3.12-beta1] at org.apache.poi.openxml4j.opc.Package. (Package.java:37) ~[poi-ooxml-3.12-beta1.jar:3.12-beta1] at org.apache.poi.openxml4j.opc.ZipPackage. (ZipPackage.java:105) ~[poi-ooxml-3.12-beta1.jar:3.12-beta1] at org.apache.poi.openxml4j.opc.OPCPackage.open(OPCPackage.java:224) ~[poi-ooxml-3.12-beta1.jar:3.12-beta1] at org.apache.tika.parser.pkg.ZipContainerDetector.detectOPCBased(ZipContainerDetector.java:208) ~[tika-parsers-1.8.jar:1.8] at org.apache.tika.parser.pkg.ZipContainerDetector.detectZipFormat(ZipContainerDetector.java:145) ~[tika-parsers-1.8.jar:1.8] at org.apache.tika.parser.pkg.ZipContainerDetector.detect(ZipContainerDetector.java:88) ~[tika-parsers-1.8.jar:1.8] at org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:61) ~[tika-core-1.8.jar:1.8] at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:112) ~[tika-core-1.8.jar:1.8] at org.exist.contentextraction.ContentExtraction.extractContentAndMetadata(ContentExtraction.java:56) ~[exist-contentextraction.jar:?] ... Reference:
Note: I tried and failed at the following workarounds: (1) upgrade Tika to 1.12 and (2) downgrade POI to 3.9-20121203. My guess is that there are duplicate jars or conflicting classpath issues, but this is beyond my skill level.
— Reply to this email directly or view it on GitHub https://github.com/eXist-db/exist/issues/937.
Adam Retter
skype: adam.retter tweet: adamretter http://www.adamretter.org.uk
@adamretter Thank you! A couple of questions based on my earlier attempt:
When I updated to 1.12, I modified my local copy of https://github.com/eXist-db/exist/blob/develop/extensions/contentextraction/ivy.xml#L5 and changed:
<dependency org="org.apache.tika" name="tika-parsers" rev="1.8" conf="*->*,!sources,!javadoc">
from line 5 to:
<dependency org="org.apache.tika" name="tika-parsers" rev="1.12" conf="*->*,!sources,!javadoc">
When I then rebuilt eXist with build.sh rebuild
, I noticed 10-15 new jars in extensions/contentextraction/lib
, some of which eXist already had other copies/versions of, e.g. lucene 4.0 jars. eXist wouldn't start with these duplicates, and I wasn't successful at trimming everything out that looked to be a duplicate. In previous upgrades @dizzzz appears to have gone through the list of dependencies for the version in question and added <exclude>
directives to ivy.xml
.
I can make this change to ivy.xml's dependency/@rev
attribute and rebuild to update to Tika 1.12, but I'm worried about the additional dependencies and the conflicts - leading to the startup failure again. Is there a way to do as you suggested but to sidestep the issue of conflicts? I'll be happy to test a procedure to see if it works.
I can revise the exclude list for known JAR files. as @joewiz said, have done that over and over again, not really a problem.
In essence, this kind of issues can pop-up with with any XAR file that deploys JAR files....
I'd propose to have ivy do the dependancies ; I have excluded some jar files, because they were large and I could not think of any usage (why include database drivers :-) )
@joewiz Yeah so that is Ivy, which tries to do dependency management for you.
Unfortunately most of eXist is not Ivy aware, and where we do have Ivy we have multiple Ivy scripts that are unaware of each other, so you will always have the potential for version conflicts and duplicates. Apart from you manually pruning it, there is nothing to be done easily without replacing the build process of eXist.
On 10 March 2016 at 11:09, Joe Wicentowski notifications@github.com wrote:
@adamretter https://github.com/adamretter Thank you! A couple of questions based on my earlier attempt:
When I updated to 1.12, I modified my local copy of https://github.com/eXist-db/exist/blob/develop/extensions/contentextraction/ivy.xml#L5 and changed:
<dependency org="org.apache.tika" name="tika-parsers" rev="1.8" conf="->,!sources,!javadoc">
from line 5 to:
<dependency org="org.apache.tika" name="tika-parsers" rev="1.12" conf="->,!sources,!javadoc">
When I then rebuilt eXist with build.sh rebuild, I noticed 10-15 new jars in extensions/contentextraction/lib, some of which eXist already had other copies/versions of, e.g. lucene 4.0 jars. eXist wouldn't start with these duplicates, and I wasn't successful at trimming everything out that looked to be a duplicate. In previous upgrades @dizzzz https://github.com/dizzzz appears to have gone through the list of dependencies for the version in question and added
directives to ivy.xml. I can make this change to ivy.xml and rebuild to update to Tika 1.12, but I'm worried about the additional dependencies and the conflicts - leading to the startup failure again. Is there a way to do as you suggested but to sidestep the issue of conflicts? I'll be happy to test a procedure to see if it works.
— Reply to this email directly or view it on GitHub https://github.com/eXist-db/exist/issues/937#issuecomment-194928166.
Adam Retter
skype: adam.retter tweet: adamretter http://www.adamretter.org.uk
@dizzzz That won't work as eXist is composed as optional modules. To use Ivy successfully we would need a single Ivy context which is aware of all dependencies in eXist (including optional modules), at the moment we have several independent Ivy contexts, hence the potential for duplicates etc.
As far as I am aware, a single Ivy context cannot handle optional modules and their dependencies. So I don't see Ivy as a solution...
On 10 March 2016 at 11:15, Dannes Wessels notifications@github.com wrote:
I'd propose to have ivy do the dependancies ; I have excluded some jar files, because they were large and I could not think of any usage (why include database drivers :-) )
— Reply to this email directly or view it on GitHub https://github.com/eXist-db/exist/issues/937#issuecomment-194930444.
Adam Retter
skype: adam.retter tweet: adamretter http://www.adamretter.org.uk
we don't look for a perfect solution ; how just wants to have it working again :-)
@dizzzz Yes I understand. I was just commenting on the larger issue ;-)
Using the current develop branch, contentextraction functions fail with a java.lang.NoClassDefFoundError error.
Steps to reproduce:
Resulting error from exist.log:
Reference:
http://stackoverflow.com/questions/28145857/nosuchmethoderror-initialization-failure-while-reading-excel-xlsx-file-using-a
Note: I tried and failed at the following workarounds: (1) upgrade Tika to 1.12 and (2) downgrade POI to 3.9-20121203. My guess is that there are duplicate jars or conflicting classpath issues, but this is beyond my skill level.