Closed bockensm closed 6 years ago
Can you attach your config? Is it possible you are using pre-parse handlers meant for text on PDFs? If so, make sure you add the <restrictTo ...>
tag to make sure those handlers only deal with text files. For instance, for a "tagger", it could look like this:
<tagger ... >
...
<restrictTo field="document.contentType">text/html</restrictTo>
</tagger>
@essiembre
It looks like adding <restrictTo>
for the text/html content type was the missing piece. I added it to <preParseHandler>
and the crawler now runs with no warnings.
This issue has plagued me for weeks, so I am relieved that it is behind me. Thank you very much for your help!
The config we're using follows.
<?xml version="1.0" encoding="UTF-8"?>
<httpcollector id="svee_skiffmed_com-collector">
<progressDir>/opt/collectors/svee_skiffmed_com/progress</progressDir>
<logsDir>/opt/collectors/svee_skiffmed_com/logs</logsDir>
<crawlers>
<crawler id="svee_skiffmed_com">
<startURLs stayOnDomain="true">
<url>https://www.skiffmed.com/</url>
</startURLs>
<userAgent>[redacted]</userAgent>
<maxDepth>10</maxDepth>
<numThreads>20</numThreads>
<delay default="0" />
<sitemapResolverFactory ignore="true" />
<workDir>/opt/collectors/svee_skiffmed_com/work</workDir>
<orphansStrategy>DELETE</orphansStrategy>
<referenceFilters>
<filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" caseSensitive="false" onMatch="exclude">https://www.skiffmed.com/sitevizenterprise/website/*^M</filter>
<filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" caseSensitive="false" onMatch="exclude">https://www.skiffmed.com/account/*</filter>
</referenceFilters>
<!-- Document importing -->
<importer>
<preParseHandlers>
<transformer class="com.norconex.importer.handler.transformer.impl.StripBetweenTransformer" inclusive="true" caseSensitive="false">
<stripBetween>
<start><![CDATA[<!--crawler_ignore-->]]></start>
<end><![CDATA[<!--/crawler_ignore-->]]></end>
</stripBetween>
</transformer>
</preParseHandlers>
<postParseHandlers>
<!-- If your target repository does not support arbitrary fields, make sure you only keep the fields you need. -->
<tagger class="com.norconex.importer.handler.tagger.impl.KeepOnlyTagger">
<fields>title,keywords,description,document.reference</fields>
</tagger>
</postParseHandlers>
</importer>
<committer class="com.norconex.committer.elasticsearch.ElasticsearchCommitter">
<indexName>svee_skiffmed_com</indexName>
<nodes>[redacted]</nodes>
<typeName>doc</typeName>
<queueDir>/opt/collectors/svee_skiffmed_com/queue</queueDir>
</committer>
<crawlerListeners>
<listener class="com.norconex.collector.http.crawler.event.impl.URLStatusCrawlerEventListener">
<statusCodes>403,404</statusCodes>
<outputDir>/opt/collectors/svee_skiffmed_com/broken</outputDir>
</listener>
</crawlerListeners>
</crawler>
</crawlers>
</httpcollector>
I'm using the Norconex HTTP collector (v2.8.0) and am having some issues with extracting contents from PDFs.
Here's a gist of the error: https://gist.github.com/mbockenstedt/4f521a44f21221671c64e62fc0db2123
(I'm unsure how to expand the "34 more" at the bottom. Maybe there's a clue buried in there that would be helpful.)
The URL to the file can be found in the first line. I contacted the PDFBox mailing list and they told me they were able to extract contents of the document so I thought I'd come here and ask for some advice.
I have seen this problem when using the versions of PDFBox & Tika that come with 2.8.0 as well as when I manually dropped the 2.0.8 JARs in to my lib directory. The Tika JARs all indicate they're version 1.16.