Norconex / importer

Norconex Importer is a Java library and command-line application meant to "parse" and "extract" content out of a file as plain text, whatever its format (HTML, PDF, Word, etc). In addition, it allows you to perform any manipulation on the extracted text before using it in your own service or application.
http://www.norconex.com/collectors/importer/
Apache License 2.0
32 stars 23 forks source link

Importing MS Office documents fails if MIME type is wrong #67

Closed ronjakoi closed 6 years ago

ronjakoi commented 6 years ago

I need to crawl an intranet with a lot of file attachments. Many of them are Microsoft Office documents. Unfortunately they seem to be somewhat consistently served with the wrong Content-Type: application/vnd.ms-powerpoint for .pptx files, even though (as far as I can tell) the correct type would be something like application/vnd.openxmlformats-officedocument.presentationml.presentation.

So this throws a com.norconex.importer.parser.DocumentParserException, ultimately caused by org.apache.poi.poifs.filesystem.OfficeXmlFileException: The supplied data appears to be in the Office 2007+ XML. You are calling the part of POI that deals with OLE2 Office Documents. You need to call a different part of POI to process this data (eg XSSF instead of HSSF).

See attachment for full stack trace (edited for privacy).

My question is: is there a way that Importer could catch the OfficeXmlFileException from Apache POI and try a different Office file format before giving up?

Or is there another way I could approach this problem?

ronjakoi commented 6 years ago

I should mention that fixing the intranet is not a feasible solution, as it is legacy software and scheduled for replacement within a year or so

essiembre commented 6 years ago

A possible solution: In case it is the extension throwing it off, change it before you import it. For instance, if you are using the HTTP Collector, you can use the GenericURLNormalizer to perform search & replace on the extension for faulty documents.

Otherwise, a document would be useful to help reproduce. Is there one you can share?

ronjakoi commented 6 years ago

There is no extension on the URI itself, however the response header from the intranet does contain a line like this:

content-disposition: inline; filename="something.pptx"

Unfortunately I cannot share a document.

essiembre commented 6 years ago

How about you give one (an extension)? Sometimes it can make a difference.

ronjakoi commented 6 years ago

I cannot modify the intranet, as it is legacy software, and changing the URIs would both be a lot of work as well as break a lot of things.

essiembre commented 6 years ago

I meant with GenericURLNormalizer, you can perform a search and replace on URL to give the problematic ones an extension and see if that helps.

ronjakoi commented 6 years ago

But if I normalize the URL to something that it's not, then the reference will be wrong, correct?

Also, I don't see anything in GenericURLNormalizer that would allow me to add file extensions based on MIME type.

essiembre commented 6 years ago

Modifying the URL is to help with troubleshooting.

One other thing you can try is download a faulty file and try to run the importer directly on it. If you downloaded the Importer on its own, you will get an "importer.sh" or "importer.bat" script. See if you get better results that way. This could rule out (or confirm) if the issue is with an invalid HTTP Content-Type or something else.

ronjakoi commented 6 years ago

Running importer.sh on the file directly does import it successfully. I haven't been able to find a solution to work the file extension into my URL, though.

I tried troubleshooting by just having the one URL in my seed list and adding the extension with:

<replace>
    <match>$</match>
    <replacement>\.pptx</replacement>
</replace>

But of course all that gets me is INFO [CrawlerEventManager] REJECTED_NOTFOUND: https://intranet.mysite.fi/path/to/file.pptx, because the file is at https://intranet.mysite.fi/path/to/file.

So then I tried something like this:

<importer>
    <preParseHandlers>
        <tagger class="com.norconex.importer.handler.tagger.impl.ScriptTagger" >
            <script><![CDATA[
            if(metadata["content-disposition"]) {
                var found = metadata["content-disposition"][0].match('filename="(.*)"');
                if(found[1]) {
                    var filename = found[1];
                    metadata.addString("url_original", reference);
                    reference = reference.concat("/", encodeURIComponent(filename));
                }
            }
            ]]>
            </script>
        </tagger>
    </preParseHandlers>

    <postParseHandlers>
        <tagger class="com.norconex.importer.handler.tagger.impl.ScriptTagger" >
            <script><![CDATA[
            if(metadata["url_original"]) {
                reference = metadata["url_original"][0];
            }
            ]]>
            </script>
        </tagger>
    </postParseHandlers>
</importer>

Essentially the idea is to detect in pre-parsing if there is a content-disposition header, take the original filename from that and stick it on the end of the reference. Then I hoped the file extension on the reference would trigger Importer to use the correct parser. In post-parsing I would switch back the original URL.

However, editing the reference doesn't seem to affect what Importer thinks it is. I still get essentially the same error log:

INFO  [CrawlerEventManager]          DOCUMENT_FETCHED: https://intranet.mysite.fi/path/to/file
INFO  [CrawlerEventManager]       CREATED_ROBOTS_META: https://intranet.mysite.fi/path/to/file
INFO  [CrawlerEventManager]            URLS_EXTRACTED: https://intranet.mysite.fi/path/to/file
WARN  [Importer] Could not import https://intranet.mysite.fi/path/to/file
com.norconex.importer.parser.DocumentParserException: org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@4fa598c8
    at com.norconex.importer.parser.impl.AbstractTikaParser.parseDocument(AbstractTikaParser.java:151)
    at com.norconex.importer.Importer.parseDocument(Importer.java:422)
    at com.norconex.importer.Importer.importDocument(Importer.java:318)
    at com.norconex.importer.Importer.doImportDocument(Importer.java:271)
    at com.norconex.importer.Importer.importDocument(Importer.java:195)
    at com.norconex.collector.core.pipeline.importer.ImportModuleStage.execute(ImportModuleStage.java:37)
    at com.norconex.collector.core.pipeline.importer.ImportModuleStage.execute(ImportModuleStage.java:26)
    at com.norconex.commons.lang.pipeline.Pipeline.execute(Pipeline.java:91)
    at com.norconex.collector.http.crawler.HttpCrawler.executeImporterPipeline(HttpCrawler.java:358)
    at com.norconex.collector.core.crawler.AbstractCrawler.processNextQueuedCrawlData(AbstractCrawler.java:521)
    at com.norconex.collector.core.crawler.AbstractCrawler.processNextReference(AbstractCrawler.java:407)
    at com.norconex.collector.core.crawler.AbstractCrawler$ProcessReferencesRunnable.run(AbstractCrawler.java:789)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@4fa598c8
    at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:282)
    at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
    at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
    at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:188)
    at com.norconex.importer.parser.impl.AbstractTikaParser$MergeEmbeddedParser.parse(AbstractTikaParser.java:410)
    at com.norconex.importer.parser.impl.AbstractTikaParser.parseDocument(AbstractTikaParser.java:148)
    ... 14 more
Caused by: org.apache.poi.poifs.filesystem.OfficeXmlFileException: The supplied data appears to be in the Office 2007+ XML. You are calling the part of POI that deals with OLE2 Office Documents. You need to call a different part of POI to process this data (eg XSSF instead of HSSF)
    at org.apache.poi.poifs.storage.HeaderBlock.<init>(HeaderBlock.java:152)
    at org.apache.poi.poifs.storage.HeaderBlock.<init>(HeaderBlock.java:140)
    at org.apache.poi.poifs.filesystem.NPOIFSFileSystem.<init>(NPOIFSFileSystem.java:302)
    at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:122)
    at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
    ... 19 more
INFO  [CrawlerEventManager]           REJECTED_IMPORT: https://intranet.mysite.fi/path/to/file (com.norconex.importer.response.ImporterResponse@3278d502)

Any ideas? The only work-around I can still think of is making a lookup table of Office document file extensions along with their appropriate MIME types and switching the Content-Type metadata in pre-parsing with ScriptTagger according to which extension is on the content-disposition: inline; filename="foo.bar" header.

essiembre commented 6 years ago

You can't share your documents publicly, but can you send one to me directly via email? Look at my profile for the email.

ronjakoi commented 6 years ago

I tried this:

<tagger class="com.norconex.importer.handler.tagger.impl.ScriptTagger" >
    <script><![CDATA[
    if(metadata["content-disposition"]) {
        // https://blogs.msdn.microsoft.com/vsofficedeveloper/2008/05/08/office-2007-file-format-mime-types-for-http-content-streaming-2/
        var office_mimetypes = {
            "doc": "application/msword",
            "dot": "application/msword",
            "docx": "application/vnd.openxmlformats-officedocument.wordprocessingml.document",
            "dotx": "application/vnd.openxmlformats-officedocument.wordprocessingml.template",
            "docm": "application/vnd.ms-word.document.macroEnabled.12",
            "dotm": "application/vnd.ms-word.template.macroEnabled.12",
            "xls": "application/vnd.ms-excel",
            "xlt": "application/vnd.ms-excel",
            "xla": "application/vnd.ms-excel",
            "xlsx": "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet",
            "xltx": "application/vnd.openxmlformats-officedocument.spreadsheetml.template",
            "xlsm": "application/vnd.ms-excel.sheet.macroEnabled.12",
            "xltm": "application/vnd.ms-excel.template.macroEnabled.12",
            "xlam": "application/vnd.ms-excel.addin.macroEnabled.12",
            "xlsb": "application/vnd.ms-excel.sheet.binary.macroEnabled.12",
            "ppt": "application/vnd.ms-powerpoint",
            "pot": "application/vnd.ms-powerpoint",
            "pps": "application/vnd.ms-powerpoint",
            "ppa": "application/vnd.ms-powerpoint",
            "pptx": "application/vnd.openxmlformats-officedocument.presentationml.presentation",
            "potx": "application/vnd.openxmlformats-officedocument.presentationml.template",
            "ppsx": "application/vnd.openxmlformats-officedocument.presentationml.slideshow",
            "ppam": "application/vnd.ms-powerpoint.addin.macroEnabled.12",
            "pptm": "application/vnd.ms-powerpoint.presentation.macroEnabled.12",
            "potm": "application/vnd.ms-powerpoint.presentation.macroEnabled.12",
            "ppsm": "application/vnd.ms-powerpoint.slideshow.macroEnabled.12"
        };

        var found = metadata["content-disposition"][0].match('filename="(.*)"');
        if(found[1]) {
            var filename = found[1];
            var splitname = filename.split(".");
            var ext = splitname[splitname.length - 1];
            if(splitname.length > 1 && ext in office_mimetypes) {
                metadata["Content-Type"][0] = office_mimetypes[ext];
            }
        }
    }
    ]]>
    </script>
</tagger>

I still get the same stack trace. If I use DebugTagger, it does show the modified Content-Type, but I guess at the point where Importer runs any taggers, even in preparsing, it has already decided which parser to call.

I will try and ask someone about sharing a file, but I doubt that will fix the problem. As far as I can tell, there isn't anything wrong with these files, the intranet just lies about their MIME type. Either Importer or Tika then blindly trusts the type provided by the web server, picks a parser based on that, and then the parser fails when the file wasn't what the parser was prepared to receive. And it doesn't look like I can use any Norconex-provided scripting interfaces to influence that.

ronjakoi commented 6 years ago

In Importer.java, lines 191–201:

public ImporterResponse importDocument(final InputStream input, 
        ContentType contentType, String charEncoding,
        Properties metadata, String reference) {        
    try {
        return doImportDocument(input, contentType, 
                charEncoding, metadata, reference);
    } catch (ImporterException e) {
        LOG.warn("Could not import " + reference, e);
        return new ImporterResponse(reference, new ImporterStatus(e));
    }
}

The contentType variable is fed directly from the HTTP Collector. Then in doImportDocument() there's this from line 227:

        //--- Content Type ---
        ContentType safeContentType = contentType;
        if (safeContentType == null 
                || StringUtils.isBlank(safeContentType.toString())) {
            try {
                safeContentType = 
                        contentTypeDetector.detect(content, reference);
            } catch (IOException e) {
                LOG.error("Could not detect content type. Defaulting to "
                        + "\"application/octet-stream\".", e);
                safeContentType = 
                        ContentType.valueOf("application/octet-stream");
            }
        }

If contentType is null or an empty string, contentTypeDetector is used to find out the type. This is the reason the command line tool import.sh works on my test file: on the command line there is no contentType, so it is auto-detected.

Lines 266–268:

            ImporterDocument document = 
                    new ImporterDocument(reference, content, meta);
            document.setContentType(safeContentType);

The same untouched content type still follows along when constructing the document object and is then fed to the private method importDocument() on line 271:

            ImporterStatus filterStatus = importDocument(document, nestedDocs);

importDocument() executes the pre-parse handlers but as far as I can tell, those don't affect the document object at all. Finally the document object is fed to parseDocument(), still containing the original content type from the HTTP header (line 318):

        parseDocument(document, nestedDocs);

Lines 399–405 in parseDocument():

    private void parseDocument(
            ImporterDocument doc, List<ImporterDocument> embeddedDocs)
            throws IOException, ImporterException {

        IDocumentParserFactory factory = importerConfig.getParserFactory();
        IDocumentParser parser = 
                factory.getParser(doc.getReference(), doc.getContentType());

So the only thing that Importer uses to choose which parser to use seems to be the Content-Type header from the HTTP server, and when that is incorrect (but non-null and non-empty), no auto-detection is done: the parser throws an exception and the import fails.

Suggestion: Please put the call to parseDocument() in a try-catch or something, and if parsing fails, try to run contentTypeDetector.detect() on the document to see if that gets a different content-type string, then run parseDocument() again, and if it still fails, then consider importing the document unsuccessful.

Another solution could be to take the content-type from the modified metadata object after pre-parsing, in case the user has done something to metadata["content-type"] with a tagger (as I attempted above).

Unfortunately I am not good enough at Java or familiar enough with the codebase to make a patch myself, I can just read Java a little bit.

ronjakoi commented 6 years ago

What is the status of this issue, please?

essiembre commented 6 years ago

I was hoping to get my hands on a document to reproduce, but whenever I have a chance I will see if your suggestion impacts anything else. Thanks for taking the time to troubleshoot deeper on your side. If you ever get the hang of it, you could also issue a pull-request. :-)

essiembre commented 6 years ago

You know what, I think the solution may have been under our noses all that time. :-)

In your HTTP Collector crawler, you can set detectContentType (or even detectCharset) on your document fetcher if you do not want to trust what's coming from the web server.
Basically, you should have this in your crawler config:

<documentFetcher detectContentType="true" detectCharset="true"/>

That will perform the same type of detection the Importer does when no content type is provided.

Please give it a try and confirm.

ronjakoi commented 6 years ago

Thank you very much, I can't believe I didn't see that option! This completely solves my problem. It seems to increase CPU usage quite a bit, but that is a small price to pay :+1: