Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
https://opensource.norconex.com/crawlers
Apache License 2.0
183 stars 67 forks source link

No successful imports in 2.8.0 (NullPointerException and more) #456

Closed ronjakoi closed 6 years ago

ronjakoi commented 6 years ago

I just tried to upgrade from 2.7.1 to 2.8.0 in my test environment. I didn't touch my configuration (which works in 2.7.1) at all.

It looks like I don't get any successful imports. I get two kinds of errors. The first one is:

intranet-en: 2018-01-17 14:14:33 INFO -            REJECTED_ERROR: https://intranet.mydomain.fi/REDACTED (java.lang.NullPointerException)
intranet-en: 2018-01-17 14:14:33 ERROR - intranet-en: Could not process document: https://intranet.mydomain.fi/REDACTED (null)
java.lang.NullPointerException
        at java.util.regex.Matcher.getTextLength(Matcher.java:1283)
        at java.util.regex.Matcher.reset(Matcher.java:309)
        at java.util.regex.Matcher.<init>(Matcher.java:229)
        at java.util.regex.Pattern.matcher(Pattern.java:1093)
        at com.norconex.collector.http.recrawl.impl.GenericRecrawlableResolver.getMatchingMinFrequency(GenericRecrawlableResolver.java:218)
        at com.norconex.collector.http.recrawl.impl.GenericRecrawlableResolver.isRecrawlable(GenericRecrawlableResolver.java:189)
        at com.norconex.collector.http.pipeline.importer.RecrawlableResolverStage.executeStage(RecrawlableResolverStage.java:60)
        at com.norconex.collector.http.pipeline.importer.AbstractImporterStage.execute(AbstractImporterStage.java:31)
        at com.norconex.collector.http.pipeline.importer.AbstractImporterStage.execute(AbstractImporterStage.java:24)
        at com.norconex.commons.lang.pipeline.Pipeline.execute(Pipeline.java:91)
        at com.norconex.collector.http.crawler.HttpCrawler.executeImporterPipeline(HttpCrawler.java:360)
        at com.norconex.collector.core.crawler.AbstractCrawler.processNextQueuedCrawlData(AbstractCrawler.java:538)
        at com.norconex.collector.core.crawler.AbstractCrawler.processNextReference(AbstractCrawler.java:419)
        at com.norconex.collector.core.crawler.AbstractCrawler$ProcessReferencesRunnable.run(AbstractCrawler.java:812)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)

The second one is (this is a different URL):

intranet-en: 2018-01-17 14:14:45 INFO -           REJECTED_IMPORT: https://intranet.mydomain.fi/REDACTED (com.norconex.importer.response.ImporterResponse@2155f4cc)

I get absolutely no DOCUMENT_IMPORTED or DOCUMENT_COMMITTED_ADD events in my log.

The first error mentions GenericRecrawlableResolver. This is my relevant configuration:

        <recrawlableResolver
         class="com.norconex.collector.http.recrawl.impl.GenericRecrawlableResolver"
         sitemapSupport="first" >
            <minFrequency applyTo="contentType" value="monthly">application/pdf</minFrequency>
            <minFrequency applyTo="contentType" value="monthly">application/(.*powerpoint|.*presentationml).*</minFrequency>
        </recrawlableResolver>
essiembre commented 6 years ago

When you upgraded, did you overwrite existing folders/files or you started with a clean install? If you did the first, you likely have conflicting versions of Jars in the "lib" folder. It is best you install this new release in a new folder and copy your configs (and reinstall/copy whatever committer you use).

If that's not your situation, can you share a URL causing this problem?

ronjakoi commented 6 years ago

My Ansible deployment does indeed delete the existing collector directory beforehand, so there are no old jars in the lib folder.

I got it to work by moving my work directories aside, so the collector can start from scratch. Somehow existing workdirs from 2.7.1 cause these "Could not process document" errors.

essiembre commented 6 years ago

Glad you have it working.