Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
https://opensource.norconex.com/crawlers
Apache License 2.0
183 stars 67 forks source link

HttpImporterPipeline fails to run stage HttpMetadataChecksumStage(false) #66

Closed leonardsaers closed 9 years ago

leonardsaers commented 9 years ago

I have a strange behaviour where pages are added for indexing if it's new and deleted if it has been crawled before.

The expected behaviour should be to skip indexing if page is unmodied or index if the page has changed since last crawl.

I get the following two log messages which I think is written when deciding to wrongfully delete the page:

DEBUG [Pipeline] Unsuccessful stage execution: com.norconex.collector.http.pipeline.importer.HttpMetadataChecksumStage@538638d9
INFO  [CrawlerEventManager]           REJECTED_IMPORT: http://regler.uu.se/Listsida/?kategoriId=99 (Subject: none)

Changing HttpImporterPipeline to uncomment addStage(new HttpMetadataChecksumStage(false)); solves the problem, but reindexing of pages migth than be missed.

    public HttpImporterPipeline(boolean isKeepDownloads) {
        addStage(new DelayResolverStage());

        // When HTTP headers are fetched (HTTP "HEAD") before document:
        addStage(new HttpMetadataFetcherStage());
        addStage(new HttpMetadataFiltersHEADStage());
        addStage(new HttpMetadataChecksumStage(true));

        // HTTP "GET" and onward:
        addStage(new DocumentFetcherStage());
        if (isKeepDownloads) {
            addStage(new SaveDocumentStage());
        }
        addStage(new RobotsMetaCreateStage());
        addStage(new LinkExtractorStage());
        addStage(new RobotsMetaNoIndexStage());
        addStage(new HttpMetadataFiltersGETStage());
        //addStage(new HttpMetadataChecksumStage(false));
        addStage(new DocumentFiltersStage());
        addStage(new DocumentPreProcessingStage());        
        addStage(new ImportModuleStage());        
    }
essiembre commented 9 years ago

I have not witnessed this exact behavior, but in trying to reproduce, I got to a point where a document was detected as "unmodified" on a subsequent run, but new or modified when I run it one more time (when it should always have been unmodified in my case). So there is definitely something suspicious here, and it may be related to your issue as well.

I never got any deletions sent to the committer upon getting REJECTED_IMPORT though. So maybe a copy of your config could help reproduce that issue.

In the meantime, we are missing a "disable" flag on the <metadataChecksummer> tag, but leaving sourceField blank will achieve the same effect:

      <metadataChecksummer 
              class="com.norconex.collector.http.checksum.impl.HttpMetadataChecksummer"
              sourceField="" />
essiembre commented 9 years ago

Fix is in 2.1.0-SNAPSHOT.

leonardsaers commented 9 years ago

I still have the same behaviour using 2.1.0-SNAPSHOT. I will try to dig deeper in to the exact problem.

essiembre commented 9 years ago

After upgrading, did you start fresh (deleted all generated files)? Also, you went with a clean install or an overwrite? If overwrite, watch for duplicate jars in the lib folder.

essiembre commented 9 years ago

FYI, a new snapshot was just released that may address this issue.

leonardsaers commented 9 years ago

Nice, I will try it out and if the problem persists I will examine it deeper.

leonardsaers commented 9 years ago

The latest snapshot works fine. However, I had to upgrade java to version 8.

I get the following exception when using the latest SNAPSHOT and openjdk 7

java.lang.NoSuchMethodError: java.util.concurrent.ConcurrentHashMap.keySet()Ljava/util/concurrent/ConcurrentHashMap$KeySetView;
    at com.norconex.committer.core.AbstractFileQueueCommitter.commit(AbstractFileQueueCommitter.java:233)
    at com.norconex.collector.core.crawler.AbstractCrawler.execute(AbstractCrawler.java:247)
    at com.norconex.collector.core.crawler.AbstractCrawler.doExecute(AbstractCrawler.java:208)
    at com.norconex.collector.core.crawler.AbstractCrawler.startExecution(AbstractCrawler.java:170)
    at com.norconex.jef4.job.AbstractResumableJob.execute(AbstractResumableJob.java:49)
    at com.norconex.jef4.suite.JobSuite.runJob(JobSuite.java:351)
    at com.norconex.jef4.suite.JobSuite.doExecute(JobSuite.java:301)
    at com.norconex.jef4.suite.JobSuite.execute(JobSuite.java:171)
    at com.norconex.collector.core.AbstractCollector.start(AbstractCollector.java:116)
    at com.norconex.collector.core.AbstractCollectorLauncher.launch(AbstractCollectorLauncher.java:69)
    at com.norconex.collector.http.HttpCollector.main(HttpCollector.java:75)
essiembre commented 9 years ago

Not good. We want to support versions of Java at least one version behind the current. I'll mark this as a bug in a new issue.

essiembre commented 9 years ago

Norconex HTTP Collector 2.1.0 was released. Closing.