eXist-db / exist

eXist Native XML Database and Application Platform
https://exist-db.org
GNU Lesser General Public License v2.1
421 stars 179 forks source link

contentextraction:get-metadata-and-content() for an XLSX file fails #3835

Open lschult2 opened 3 years ago

lschult2 commented 3 years ago

What is the problem

Content extraction for even the simplest of XLSX files fails.

xquery version "3.1";
let $binary := util:binary-doc('/db/test.xlsx')
return
    contentextraction:get-metadata-and-content($binary)

This returns an error

exerr:ERROR Problem with content extraction library: Error creating OOXML extractor [at line 5, column 5]

exist.log

2021-04-21 14:35:52,264 [qtp353927430-35] ERROR (ContentFunctions.java [eval]:168) - Problem with content extraction library: Error creating OOXML extractor 
org.exist.contentextraction.ContentExtractionException: Problem with content extraction library: Error creating OOXML extractor
    at org.exist.contentextraction.ContentExtraction.extractContentAndMetadata(ContentExtraction.java:58) ~[exist-contentextraction-5.2.0.jar:5.2.0]
    ...
Caused by: org.apache.tika.exception.TikaException: Error creating OOXML extractor
    at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:221) ~[tika-parsers-1.23.jar:1.23]
    ...
Caused by: org.apache.poi.openxml4j.exceptions.InvalidFormatException: Can't read content types part !
    at org.apache.poi.openxml4j.opc.internal.ContentTypeManager.<init>(ContentTypeManager.java:106) ~[poi-ooxml-4.1.1.jar:4.1.1]
    ...
Caused by: org.apache.poi.openxml4j.exceptions.InvalidFormatException: XML document structures must start and end within the same entity.
    at org.apache.poi.openxml4j.opc.internal.ContentTypeManager.parseContentTypesFile(ContentTypeManager.java:418) ~[poi-ooxml-4.1.1.jar:4.1.1]
    ...

contentextraction:get-metadata-and-content($binary) does work if the binary is a PDF file, but not when it is an XLSX file.

What did you expect

Expected to return HTML tables of the contents of each sheet in the XLSX file. Tika via the command line does work, and shows what the output should be java -jar tika-app-1.23.jar file:///tmp/test.xlsx

<?xml version="1.0" encoding="UTF-8"?><html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta name="date" content="2016-03-10T14:09:51Z"/>
<meta name="extended-properties:AppVersion" content="15.0300"/>
<meta name="dc:creator" content="Microsoft Office User"/>
<meta name="extended-properties:Company" content=""/>
<meta name="dcterms:created" content="2016-03-10T14:09:44Z"/>
<meta name="dcterms:modified" content="2016-03-10T14:09:51Z"/>
<meta name="Last-Modified" content="2016-03-10T14:09:51Z"/>
<meta name="Last-Save-Date" content="2016-03-10T14:09:51Z"/>
<meta name="protected" content="false"/>
<meta name="meta:save-date" content="2016-03-10T14:09:51Z"/>
<meta name="Application-Name" content="Microsoft Macintosh Excel"/>
<meta name="modified" content="2016-03-10T14:09:51Z"/>
<meta name="Content-Length" content="21664"/>
<meta name="Content-Type" content="application/vnd.openxmlformats-officedocument.spreadsheetml.sheet"/>
<meta name="X-Parsed-By" content="org.apache.tika.parser.DefaultParser"/>
<meta name="X-Parsed-By" content="org.apache.tika.parser.microsoft.ooxml.OOXMLParser"/>
<meta name="creator" content="Microsoft Office User"/>
<meta name="meta:author" content="Microsoft Office User"/>
<meta name="meta:creation-date" content="2016-03-10T14:09:44Z"/>
<meta name="extended-properties:Application" content="Microsoft Macintosh Excel"/>
<meta name="meta:last-author" content="Microsoft Office User"/>
<meta name="Creation-Date" content="2016-03-10T14:09:44Z"/>
<meta name="resourceName" content="test.xlsx"/>
<meta name="Last-Author" content="Microsoft Office User"/>
<meta name="Application-Version" content="15.0300"/>
<meta name="Author" content="Microsoft Office User"/>
<meta name="publisher" content=""/>
<meta name="dc:publisher" content=""/>
<title/>
</head>
<body><div><h1>Sheet1</h1>
<table><tbody><tr>  <td>blah</td></tr>
</tbody></table>
</div>
<div class="embedded" id="/docProps/thumbnail.jpeg"/></body></html>

Describe how to reproduce or add a test

Load this test.xlsx file into /db/test.xlsx: https://github.com/eXist-db/exist/files/167259/test.xlsx

xquery version "3.1";
let $binary := util:binary-doc('/db/test.xlsx')
return
    contentextraction:get-metadata-and-content($binary)

Context information

lschult2 commented 2 years ago

I've confirmed the issue is still present with eXist-db 5.3.1

adamretter commented 2 years ago

@lschult2 I have added your test.xlsx to a Unit Test in - https://github.com/eXist-db/exist/pull/4168

I spent some time looking into your issue, unfortunately it isn't a simple one to trace or understand. There seems to be some unhappy interaction between Tika and eXist-db's CachingFilterInputStream which feeds it. I will see if I can find some more time to take a deeper look soon...

lschult2 commented 2 years ago

I've confirmed the issue is still present with eXist-db 6.0.1. But the "Caused by" is different.

2022-05-31 23:23:20,066 [qtp43546754-42] ERROR (ContentFunctions.java [eval]:173) - Problem with content extraction library: TIKA-198: Illegal IOException from org.apache.tika.parser.microsoft.ooxml.OOXMLParser@168d0b70 
org.exist.contentextraction.ContentExtractionException: Problem with content extraction library: TIKA-198: Illegal IOException from org.apache.tika.parser.microsoft.ooxml.OOXMLParser@168d0b70

Caused by: org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.microsoft.ooxml.OOXMLParser@168d0b70

Caused by: java.util.zip.ZipException: invalid code -- missing end-of-block

Caused by: java.util.zip.DataFormatException: invalid code -- missing end-of-block