Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
https://opensource.norconex.com/crawlers
Apache License 2.0
183 stars 68 forks source link

com.norconex.importer.parser.DocumentParserException: Unable to extract content from PDF #467

Closed bockensm closed 6 years ago

bockensm commented 6 years ago

I'm using the Norconex HTTP collector (v2.8.0) and am having some issues with extracting contents from PDFs.

Here's a gist of the error: https://gist.github.com/mbockenstedt/4f521a44f21221671c64e62fc0db2123

(I'm unsure how to expand the "34 more" at the bottom. Maybe there's a clue buried in there that would be helpful.)

The URL to the file can be found in the first line. I contacted the PDFBox mailing list and they told me they were able to extract contents of the document so I thought I'd come here and ask for some advice.

I have seen this problem when using the versions of PDFBox & Tika that come with 2.8.0 as well as when I manually dropped the 2.0.8 JARs in to my lib directory. The Tika JARs all indicate they're version 1.16.

essiembre commented 6 years ago

Can you attach your config? Is it possible you are using pre-parse handlers meant for text on PDFs? If so, make sure you add the <restrictTo ...> tag to make sure those handlers only deal with text files. For instance, for a "tagger", it could look like this:

<tagger ... >
   ...
  <restrictTo field="document.contentType">text/html</restrictTo>
</tagger>
bockensm commented 6 years ago

@essiembre It looks like adding <restrictTo> for the text/html content type was the missing piece. I added it to <preParseHandler> and the crawler now runs with no warnings.

This issue has plagued me for weeks, so I am relieved that it is behind me. Thank you very much for your help!

The config we're using follows.

<?xml version="1.0" encoding="UTF-8"?>
<httpcollector id="svee_skiffmed_com-collector">
  <progressDir>/opt/collectors/svee_skiffmed_com/progress</progressDir>
  <logsDir>/opt/collectors/svee_skiffmed_com/logs</logsDir>

  <crawlers>
    <crawler id="svee_skiffmed_com">
      <startURLs stayOnDomain="true">
        <url>https://www.skiffmed.com/</url>
      </startURLs>

      <userAgent>[redacted]</userAgent>
      <maxDepth>10</maxDepth>
      <numThreads>20</numThreads>
      <delay default="0" />
      <sitemapResolverFactory ignore="true" />
      <workDir>/opt/collectors/svee_skiffmed_com/work</workDir>
      <orphansStrategy>DELETE</orphansStrategy>

        <referenceFilters>
            <filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" caseSensitive="false" onMatch="exclude">https://www.skiffmed.com/sitevizenterprise/website/*^M</filter>

            <filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" caseSensitive="false" onMatch="exclude">https://www.skiffmed.com/account/*</filter>
        </referenceFilters>

      <!-- Document importing -->
      <importer>
        <preParseHandlers>
            <transformer class="com.norconex.importer.handler.transformer.impl.StripBetweenTransformer" inclusive="true" caseSensitive="false">
                <stripBetween>
                    <start><![CDATA[<!--crawler_ignore-->]]></start>
                    <end><![CDATA[<!--/crawler_ignore-->]]></end>
                </stripBetween>
            </transformer>
        </preParseHandlers>

        <postParseHandlers>
          <!-- If your target repository does not support arbitrary fields, make sure you only keep the fields you need. -->
          <tagger class="com.norconex.importer.handler.tagger.impl.KeepOnlyTagger">
            <fields>title,keywords,description,document.reference</fields>
          </tagger>
        </postParseHandlers>
      </importer>

      <committer class="com.norconex.committer.elasticsearch.ElasticsearchCommitter">
        <indexName>svee_skiffmed_com</indexName>
        <nodes>[redacted]</nodes>
        <typeName>doc</typeName>
        <queueDir>/opt/collectors/svee_skiffmed_com/queue</queueDir>
      </committer>

      <crawlerListeners>
        <listener class="com.norconex.collector.http.crawler.event.impl.URLStatusCrawlerEventListener">
          <statusCodes>403,404</statusCodes>
          <outputDir>/opt/collectors/svee_skiffmed_com/broken</outputDir>
        </listener>
      </crawlerListeners>
    </crawler>
  </crawlers>
</httpcollector>