Closed leonardsaers closed 9 years ago
I have not witnessed this exact behavior, but in trying to reproduce, I got to a point where a document was detected as "unmodified" on a subsequent run, but new or modified when I run it one more time (when it should always have been unmodified in my case). So there is definitely something suspicious here, and it may be related to your issue as well.
I never got any deletions sent to the committer upon getting REJECTED_IMPORT though. So maybe a copy of your config could help reproduce that issue.
In the meantime, we are missing a "disable" flag on the <metadataChecksummer>
tag, but leaving sourceField
blank will achieve the same effect:
<metadataChecksummer
class="com.norconex.collector.http.checksum.impl.HttpMetadataChecksummer"
sourceField="" />
Fix is in 2.1.0-SNAPSHOT.
I still have the same behaviour using 2.1.0-SNAPSHOT. I will try to dig deeper in to the exact problem.
After upgrading, did you start fresh (deleted all generated files)? Also, you went with a clean install or an overwrite? If overwrite, watch for duplicate jars in the lib folder.
Nice, I will try it out and if the problem persists I will examine it deeper.
The latest snapshot works fine. However, I had to upgrade java to version 8.
I get the following exception when using the latest SNAPSHOT and openjdk 7
java.lang.NoSuchMethodError: java.util.concurrent.ConcurrentHashMap.keySet()Ljava/util/concurrent/ConcurrentHashMap$KeySetView;
at com.norconex.committer.core.AbstractFileQueueCommitter.commit(AbstractFileQueueCommitter.java:233)
at com.norconex.collector.core.crawler.AbstractCrawler.execute(AbstractCrawler.java:247)
at com.norconex.collector.core.crawler.AbstractCrawler.doExecute(AbstractCrawler.java:208)
at com.norconex.collector.core.crawler.AbstractCrawler.startExecution(AbstractCrawler.java:170)
at com.norconex.jef4.job.AbstractResumableJob.execute(AbstractResumableJob.java:49)
at com.norconex.jef4.suite.JobSuite.runJob(JobSuite.java:351)
at com.norconex.jef4.suite.JobSuite.doExecute(JobSuite.java:301)
at com.norconex.jef4.suite.JobSuite.execute(JobSuite.java:171)
at com.norconex.collector.core.AbstractCollector.start(AbstractCollector.java:116)
at com.norconex.collector.core.AbstractCollectorLauncher.launch(AbstractCollectorLauncher.java:69)
at com.norconex.collector.http.HttpCollector.main(HttpCollector.java:75)
Not good. We want to support versions of Java at least one version behind the current. I'll mark this as a bug in a new issue.
Norconex HTTP Collector 2.1.0 was released. Closing.
I have a strange behaviour where pages are added for indexing if it's new and deleted if it has been crawled before.
The expected behaviour should be to skip indexing if page is unmodied or index if the page has changed since last crawl.
I get the following two log messages which I think is written when deciding to wrongfully delete the page:
Changing HttpImporterPipeline to uncomment addStage(new HttpMetadataChecksumStage(false)); solves the problem, but reindexing of pages migth than be missed.