com.norconex.importer.parser.DocumentParserException: Unable to extract content from PDF

bockensm commented 6 years ago

I'm using the Norconex HTTP collector (v2.8.0) and am having some issues with extracting contents from PDFs.

Here's a gist of the error: https://gist.github.com/mbockenstedt/4f521a44f21221671c64e62fc0db2123

(I'm unsure how to expand the "34 more" at the bottom. Maybe there's a clue buried in there that would be helpful.)

The URL to the file can be found in the first line. I contacted the PDFBox mailing list and they told me they were able to extract contents of the document so I thought I'd come here and ask for some advice.

I have seen this problem when using the versions of PDFBox & Tika that come with 2.8.0 as well as when I manually dropped the 2.0.8 JARs in to my lib directory. The Tika JARs all indicate they're version 1.16.

essiembre commented 6 years ago

Can you attach your config? Is it possible you are using pre-parse handlers meant for text on PDFs? If so, make sure you add the <restrictTo ...> tag to make sure those handlers only deal with text files. For instance, for a "tagger", it could look like this:

<tagger ... >
   ...
  <restrictTo field="document.contentType">text/html</restrictTo>
</tagger>

bockensm commented 6 years ago

@essiembre It looks like adding <restrictTo> for the text/html content type was the missing piece. I added it to <preParseHandler> and the crawler now runs with no warnings.

This issue has plagued me for weeks, so I am relieved that it is behind me. Thank you very much for your help!

The config we're using follows.

<?xml version="1.0" encoding="UTF-8"?>
<httpcollector id="svee_skiffmed_com-collector">
  <progressDir>/opt/collectors/svee_skiffmed_com/progress</progressDir>
  <logsDir>/opt/collectors/svee_skiffmed_com/logs</logsDir>

  <crawlers>
    <crawler id="svee_skiffmed_com">
      <startURLs stayOnDomain="true">
        <url>https://www.skiffmed.com/</url>
      </startURLs>

      <userAgent>[redacted]</userAgent>
      <maxDepth>10</maxDepth>
      <numThreads>20</numThreads>
      <delay default="0" />
      <sitemapResolverFactory ignore="true" />
      <workDir>/opt/collectors/svee_skiffmed_com/work</workDir>
      <orphansStrategy>DELETE</orphansStrategy>

        <referenceFilters>
            <filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" caseSensitive="false" onMatch="exclude">https://www.skiffmed.com/sitevizenterprise/website/*^M</filter>

            <filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" caseSensitive="false" onMatch="exclude">https://www.skiffmed.com/account/*</filter>
        </referenceFilters>

      <!-- Document importing -->
      <importer>
        <preParseHandlers>
            <transformer class="com.norconex.importer.handler.transformer.impl.StripBetweenTransformer" inclusive="true" caseSensitive="false">
                <stripBetween>
                    <start><![CDATA[<!--crawler_ignore-->]]></start>
                    <end><![CDATA[<!--/crawler_ignore-->]]></end>
                </stripBetween>
            </transformer>
        </preParseHandlers>

        <postParseHandlers>
          <!-- If your target repository does not support arbitrary fields, make sure you only keep the fields you need. -->
          <tagger class="com.norconex.importer.handler.tagger.impl.KeepOnlyTagger">
            <fields>title,keywords,description,document.reference</fields>
          </tagger>
        </postParseHandlers>
      </importer>

      <committer class="com.norconex.committer.elasticsearch.ElasticsearchCommitter">
        <indexName>svee_skiffmed_com</indexName>
        <nodes>[redacted]</nodes>
        <typeName>doc</typeName>
        <queueDir>/opt/collectors/svee_skiffmed_com/queue</queueDir>
      </committer>

      <crawlerListeners>
        <listener class="com.norconex.collector.http.crawler.event.impl.URLStatusCrawlerEventListener">
          <statusCodes>403,404</statusCodes>
          <outputDir>/opt/collectors/svee_skiffmed_com/broken</outputDir>
        </listener>
      </crawlerListeners>
    </crawler>
  </crawlers>
</httpcollector>

Norconex / crawlers

com.norconex.importer.parser.DocumentParserException: Unable to extract content from PDF #467