contentextraction (tika) library problem with compressed files (xslx, docx)

joewiz commented 8 years ago

Using the current develop branch, contentextraction functions fail with a java.lang.NoClassDefFoundError error.

Steps to reproduce:

Upload the attached test files:
- test.docx
- test.xlsx
Submit the following queries in eXide:

let $binary := util:binary-doc('/db/test.xlsx')
return
    contentextraction:get-metadata-and-content($binary)

let $binary := util:binary-doc('/db/test.docx')
return
    contentextraction:get-metadata-and-content($binary)

Resulting error from exist.log:

2016-03-10 09:13:39,318 [eXistThread-53] ERROR (XQueryServlet.java [process]:550) - Could not initialize class org.apache.poi.openxml4j.opc.internal.marshallers.ZipPackagePropertiesMarshaller 
java.lang.NoClassDefFoundError: Could not initialize class org.apache.poi.openxml4j.opc.internal.marshallers.ZipPackagePropertiesMarshaller
    at org.apache.poi.openxml4j.opc.OPCPackage.init(OPCPackage.java:161) ~[poi-ooxml-3.12-beta1.jar:3.12-beta1]
    at org.apache.poi.openxml4j.opc.OPCPackage.<init>(OPCPackage.java:141) ~[poi-ooxml-3.12-beta1.jar:3.12-beta1]
    at org.apache.poi.openxml4j.opc.Package.<init>(Package.java:37) ~[poi-ooxml-3.12-beta1.jar:3.12-beta1]
    at org.apache.poi.openxml4j.opc.ZipPackage.<init>(ZipPackage.java:105) ~[poi-ooxml-3.12-beta1.jar:3.12-beta1]
    at org.apache.poi.openxml4j.opc.OPCPackage.open(OPCPackage.java:224) ~[poi-ooxml-3.12-beta1.jar:3.12-beta1]
    at org.apache.tika.parser.pkg.ZipContainerDetector.detectOPCBased(ZipContainerDetector.java:208) ~[tika-parsers-1.8.jar:1.8]
    at org.apache.tika.parser.pkg.ZipContainerDetector.detectZipFormat(ZipContainerDetector.java:145) ~[tika-parsers-1.8.jar:1.8]
    at org.apache.tika.parser.pkg.ZipContainerDetector.detect(ZipContainerDetector.java:88) ~[tika-parsers-1.8.jar:1.8]
    at org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:61) ~[tika-core-1.8.jar:1.8]
    at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:112) ~[tika-core-1.8.jar:1.8]
    at org.exist.contentextraction.ContentExtraction.extractContentAndMetadata(ContentExtraction.java:56) ~[exist-contentextraction.jar:?]
...

Reference:

http://stackoverflow.com/questions/28145857/nosuchmethoderror-initialization-failure-while-reading-excel-xlsx-file-using-a

Note: I tried and failed at the following workarounds: (1) upgrade Tika to 1.12 and (2) downgrade POI to 3.9-20121203. My guess is that there are duplicate jars or conflicting classpath issues, but this is beyond my skill level.

adamretter commented 8 years ago

Okay, so I would recommend -

Update Tika 1.12

Remove whatever POI version eXist has and replace it with POI 3.13, you will need at least:

http://search.maven.org/remotecontent?filepath=org/apache/poi/poi/3.13/poi-3.13.jar http://search.maven.org/remotecontent?filepath=org/apache/poi/poi-ooxml/3.13/poi-ooxml-3.13.jar http://search.maven.org/remotecontent?filepath=org/apache/poi/poi-scratchpad/3.13/poi-scratchpad-3.13.jar

If you don't know where to put those, just drop them in lib/user, and make sure you have removed any old Tika and or POI versions.

How do I know this you might wonder? I took a look at the POM file for Tika 1.12 which describes the dependencies that are needed. You can see that here: https://repo1.maven.org/maven2/org/apache/tika/tika-parsers/1.12/tika-parsers-1.12.pom

Sadly eXist's build process has no real concept of dependencies and so we get these issues. I keep offering to rewrite the build process, I just need to find the time!

Let me know if that works. If so I will send a PR...

On 10 March 2016 at 09:20, Joe Wicentowski notifications@github.com wrote:

Using the current develop branch, contentextraction functions fail with a java.lang.NoClassDefFoundError error.

Steps to reproduce:

1.

Upload the attached test files:

test.docx https://github.com/eXist-db/exist/files/167258/test.docx

test.xlsx https://github.com/eXist-db/exist/files/167259/test.xlsx

Submit the following queries in eXide:

let $binary := util:binary-doc('/db/test.xlsx')return contentextraction:get-metadata-and-content($binary)

let $binary := util:binary-doc('/db/test.docx')return contentextraction:get-metadata-and-content($binary)

Resulting error from exist.log:

2016-03-10 09:13:39,318 [eXistThread-53] ERROR (XQueryServlet.java [process]:550) - Could not initialize class org.apache.poi.openxml4j.opc.internal.marshallers.ZipPackagePropertiesMarshaller java.lang.NoClassDefFoundError: Could not initialize class org.apache.poi.openxml4j.opc.internal.marshallers.ZipPackagePropertiesMarshaller at org.apache.poi.openxml4j.opc.OPCPackage.init(OPCPackage.java:161) ~[poi-ooxml-3.12-beta1.jar:3.12-beta1] at org.apache.poi.openxml4j.opc.OPCPackage.(OPCPackage.java:141) ~[poi-ooxml-3.12-beta1.jar:3.12-beta1] at org.apache.poi.openxml4j.opc.Package.(Package.java:37) ~[poi-ooxml-3.12-beta1.jar:3.12-beta1] at org.apache.poi.openxml4j.opc.ZipPackage.(ZipPackage.java:105) ~[poi-ooxml-3.12-beta1.jar:3.12-beta1] at org.apache.poi.openxml4j.opc.OPCPackage.open(OPCPackage.java:224) ~[poi-ooxml-3.12-beta1.jar:3.12-beta1] at org.apache.tika.parser.pkg.ZipContainerDetector.detectOPCBased(ZipContainerDetector.java:208) ~[tika-parsers-1.8.jar:1.8] at org.apache.tika.parser.pkg.ZipContainerDetector.detectZipFormat(ZipContainerDetector.java:145) ~[tika-parsers-1.8.jar:1.8] at org.apache.tika.parser.pkg.ZipContainerDetector.detect(ZipContainerDetector.java:88) ~[tika-parsers-1.8.jar:1.8] at org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:61) ~[tika-core-1.8.jar:1.8] at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:112) ~[tika-core-1.8.jar:1.8] at org.exist.contentextraction.ContentExtraction.extractContentAndMetadata(ContentExtraction.java:56) ~[exist-contentextraction.jar:?] ...

Reference:

http://stackoverflow.com/questions/28145857/nosuchmethoderror-initialization-failure-while-reading-excel-xlsx-file-using-a

Note: I tried and failed at the following workarounds: (1) upgrade Tika to 1.12 and (2) downgrade POI to 3.9-20121203. My guess is that there are duplicate jars or conflicting classpath issues, but this is beyond my skill level.

— Reply to this email directly or view it on GitHub https://github.com/eXist-db/exist/issues/937.

Adam Retter

skype: adam.retter tweet: adamretter http://www.adamretter.org.uk

joewiz commented 8 years ago

@adamretter Thank you! A couple of questions based on my earlier attempt:

When I updated to 1.12, I modified my local copy of https://github.com/eXist-db/exist/blob/develop/extensions/contentextraction/ivy.xml#L5 and changed:

<dependency org="org.apache.tika" name="tika-parsers" rev="1.8" conf="*->*,!sources,!javadoc">

from line 5 to:

<dependency org="org.apache.tika" name="tika-parsers" rev="1.12" conf="*->*,!sources,!javadoc">

When I then rebuilt eXist with build.sh rebuild, I noticed 10-15 new jars in extensions/contentextraction/lib, some of which eXist already had other copies/versions of, e.g. lucene 4.0 jars. eXist wouldn't start with these duplicates, and I wasn't successful at trimming everything out that looked to be a duplicate. In previous upgrades @dizzzz appears to have gone through the list of dependencies for the version in question and added <exclude> directives to ivy.xml.

I can make this change to ivy.xml's dependency/@rev attribute and rebuild to update to Tika 1.12, but I'm worried about the additional dependencies and the conflicts - leading to the startup failure again. Is there a way to do as you suggested but to sidestep the issue of conflicts? I'll be happy to test a procedure to see if it works.

dizzzz commented 8 years ago

I can revise the exclude list for known JAR files. as @joewiz said, have done that over and over again, not really a problem.

In essence, this kind of issues can pop-up with with any XAR file that deploys JAR files....

dizzzz commented 8 years ago

I'd propose to have ivy do the dependancies ; I have excluded some jar files, because they were large and I could not think of any usage (why include database drivers :-) )

adamretter commented 8 years ago

@joewiz Yeah so that is Ivy, which tries to do dependency management for you.

Unfortunately most of eXist is not Ivy aware, and where we do have Ivy we have multiple Ivy scripts that are unaware of each other, so you will always have the potential for version conflicts and duplicates. Apart from you manually pruning it, there is nothing to be done easily without replacing the build process of eXist.

On 10 March 2016 at 11:09, Joe Wicentowski notifications@github.com wrote:

@adamretter https://github.com/adamretter Thank you! A couple of questions based on my earlier attempt:

When I updated to 1.12, I modified my local copy of https://github.com/eXist-db/exist/blob/develop/extensions/contentextraction/ivy.xml#L5 and changed:

<dependency org="org.apache.tika" name="tika-parsers" rev="1.8" conf="->,!sources,!javadoc">

from line 5 to:

<dependency org="org.apache.tika" name="tika-parsers" rev="1.12" conf="->,!sources,!javadoc">

When I then rebuilt eXist with build.sh rebuild, I noticed 10-15 new jars in extensions/contentextraction/lib, some of which eXist already had other copies/versions of, e.g. lucene 4.0 jars. eXist wouldn't start with these duplicates, and I wasn't successful at trimming everything out that looked to be a duplicate. In previous upgrades @dizzzz https://github.com/dizzzz appears to have gone through the list of dependencies for the version in question and added directives to ivy.xml.

I can make this change to ivy.xml and rebuild to update to Tika 1.12, but I'm worried about the additional dependencies and the conflicts - leading to the startup failure again. Is there a way to do as you suggested but to sidestep the issue of conflicts? I'll be happy to test a procedure to see if it works.

— Reply to this email directly or view it on GitHub https://github.com/eXist-db/exist/issues/937#issuecomment-194928166.

Adam Retter

skype: adam.retter tweet: adamretter http://www.adamretter.org.uk

adamretter commented 8 years ago

@dizzzz That won't work as eXist is composed as optional modules. To use Ivy successfully we would need a single Ivy context which is aware of all dependencies in eXist (including optional modules), at the moment we have several independent Ivy contexts, hence the potential for duplicates etc.

As far as I am aware, a single Ivy context cannot handle optional modules and their dependencies. So I don't see Ivy as a solution...

On 10 March 2016 at 11:15, Dannes Wessels notifications@github.com wrote:

I'd propose to have ivy do the dependancies ; I have excluded some jar files, because they were large and I could not think of any usage (why include database drivers :-) )

— Reply to this email directly or view it on GitHub https://github.com/eXist-db/exist/issues/937#issuecomment-194930444.

Adam Retter

skype: adam.retter tweet: adamretter http://www.adamretter.org.uk

dizzzz commented 8 years ago

we don't look for a perfect solution ; how just wants to have it working again :-)

adamretter commented 8 years ago

@dizzzz Yes I understand. I was just commenting on the larger issue ;-)

eXist-db / exist

contentextraction (tika) library problem with compressed files (xslx, docx) #937

test.docx https://github.com/eXist-db/exist/files/167258/test.docx