Open lschult2 opened 3 years ago
I've confirmed the issue is still present with eXist-db 5.3.1
@lschult2 I have added your test.xlsx
to a Unit Test in - https://github.com/eXist-db/exist/pull/4168
I spent some time looking into your issue, unfortunately it isn't a simple one to trace or understand. There seems to be some unhappy interaction between Tika and eXist-db's CachingFilterInputStream
which feeds it. I will see if I can find some more time to take a deeper look soon...
I've confirmed the issue is still present with eXist-db 6.0.1. But the "Caused by" is different.
2022-05-31 23:23:20,066 [qtp43546754-42] ERROR (ContentFunctions.java [eval]:173) - Problem with content extraction library: TIKA-198: Illegal IOException from org.apache.tika.parser.microsoft.ooxml.OOXMLParser@168d0b70
org.exist.contentextraction.ContentExtractionException: Problem with content extraction library: TIKA-198: Illegal IOException from org.apache.tika.parser.microsoft.ooxml.OOXMLParser@168d0b70
Caused by: org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.microsoft.ooxml.OOXMLParser@168d0b70
Caused by: java.util.zip.ZipException: invalid code -- missing end-of-block
Caused by: java.util.zip.DataFormatException: invalid code -- missing end-of-block
What is the problem
Content extraction for even the simplest of XLSX files fails.
This returns an error
exist.log
contentextraction:get-metadata-and-content($binary)
does work if the binary is a PDF file, but not when it is an XLSX file.What did you expect
Expected to return HTML tables of the contents of each sheet in the XLSX file. Tika via the command line does work, and shows what the output should be
java -jar tika-app-1.23.jar file:///tmp/test.xlsx
Describe how to reproduce or add a test
Load this test.xlsx file into
/db/test.xlsx
: https://github.com/eXist-db/exist/files/167259/test.xlsxContext information