Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
https://opensource.norconex.com/crawlers
Apache License 2.0
183 stars 68 forks source link

MongoDB E11000 duplicate key error in MongoCrawlDataStore.processedToCached() #524

Closed rustyx closed 5 years ago

rustyx commented 6 years ago

I'm using MongoCrawlDataStore.

Whenever I restart a previously finished job (start/resume doesn't make a difference), I randomly get this exception:

com.mongodb.MongoBulkWriteException: Bulk write operation error on server localhost:27017. Write errors: [BulkWriteError{index=0, code=11000, message='E11000 duplicate key error collection: test.cached index: _id_ dup key: { : ObjectId('5ba0bc25468d62893fc3e700') }', details={ }}].

Example stack trace (there are other variants, all inside MongoCrawlDataStore.processedToCached):

com.mongodb.MongoBulkWriteException: Bulk write operation error on server localhost:27017. Write errors: [BulkWriteError{index=0, code=11000, message='E11000 duplicate key error collection: test.cached index: _id_ dup key: { : ObjectId('5ba0bc25468d62893fc3e700') }', details={ }}]. 
    at com.mongodb.connection.BulkWriteBatchCombiner.getError(BulkWriteBatchCombiner.java:176)
    at com.mongodb.connection.BulkWriteBatchCombiner.throwOnError(BulkWriteBatchCombiner.java:205)
    at com.mongodb.connection.BulkWriteBatchCombiner.getResult(BulkWriteBatchCombiner.java:146)
    at com.mongodb.operation.BulkWriteBatch.getResult(BulkWriteBatch.java:219)
    at com.mongodb.operation.MixedBulkWriteOperation.executeBulkWriteBatch(MixedBulkWriteOperation.java:265)
    at com.mongodb.operation.MixedBulkWriteOperation.access$700(MixedBulkWriteOperation.java:65)
    at com.mongodb.operation.MixedBulkWriteOperation$1.call(MixedBulkWriteOperation.java:198)
    at com.mongodb.operation.MixedBulkWriteOperation$1.call(MixedBulkWriteOperation.java:189)
    at com.mongodb.operation.OperationHelper.withReleasableConnection(OperationHelper.java:433)
    at com.mongodb.operation.MixedBulkWriteOperation.execute(MixedBulkWriteOperation.java:189)
    at com.mongodb.operation.MixedBulkWriteOperation.execute(MixedBulkWriteOperation.java:65)
    at com.mongodb.Mongo$3.execute(Mongo.java:837)
    at com.mongodb.MongoCollectionImpl.executeInsertMany(MongoCollectionImpl.java:552)
    at com.mongodb.MongoCollectionImpl.insertMany(MongoCollectionImpl.java:524)
    at com.mongodb.MongoCollectionImpl.insertMany(MongoCollectionImpl.java:519)
    at com.norconex.collector.core.data.store.impl.mongo.MongoCrawlDataStore.processedToCached(MongoCrawlDataStore.java:305)
    at com.norconex.collector.core.data.store.impl.mongo.MongoCrawlDataStore.<init>(MongoCrawlDataStore.java:152)
    at com.norconex.collector.core.data.store.impl.mongo.MongoCrawlDataStore.<init>(MongoCrawlDataStore.java:106)
    at com.norconex.collector.core.data.store.impl.mongo.AbstractMongoCrawlDataStoreFactory.createCrawlDataStore(AbstractMongoCrawlDataStoreFactory.java:153)
    at com.norconex.collector.core.crawler.AbstractCrawler.createCrawlDataStore(AbstractCrawler.java:243)
    at com.norconex.collector.core.crawler.AbstractCrawler.doExecute(AbstractCrawler.java:204)
    at com.norconex.collector.core.crawler.AbstractCrawler.startExecution(AbstractCrawler.java:184)
    at com.norconex.jef4.job.AbstractResumableJob.execute(AbstractResumableJob.java:49)
    at com.norconex.jef4.suite.JobSuite.runJob(JobSuite.java:355)
    at com.norconex.jef4.job.group.AsyncJobGroup.runJob(AsyncJobGroup.java:119)
    at com.norconex.jef4.job.group.AsyncJobGroup.access$000(AsyncJobGroup.java:44)
    at com.norconex.jef4.job.group.AsyncJobGroup$1.run(AsyncJobGroup.java:86)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
essiembre commented 6 years ago

Do you crawl a huge amount of documents, and does it affect the processing of affected document?

Can you share your config to attempt to reproduce?

rustyx commented 6 years ago

We just have a couple of hundred pages. Though the site is in two languages (controlled by a cookie), so we have two crawlers working on the same site in parallel, one for each language. Maybe that's causing the issue. It surely affects the processing because the crawler just dies.

Here's my attempt at reproducing it, it doesn't fail all the time, but sometimes it does:

<?xml version="1.0" encoding="UTF-8"?>
<httpcollector id="test">

  #set($http = "com.norconex.collector.http")
  #set($core = "com.norconex.collector.core")
  #set($filter = "${core}.filter.impl")

  <progressDir>./temp/progress</progressDir>
  <logsDir>./temp/logs</logsDir>
  <parseErrorsSaveDir>./temp/parse-errors</parseErrorsSaveDir>

  <crawlerDefaults>
    <workDir>./temp</workDir>
    <numThreads>4</numThreads>
    <delay class="${http}.delay.impl.GenericDelayResolver" default="0" />
    <robotsTxt ignore="true" />
    <recrawlableResolver class="${http}.recrawl.impl.GenericRecrawlableResolver">
      <minFrequency applyTo="reference" value="1">.*</minFrequency>
    </recrawlableResolver>
    <crawlDataStoreFactory class="${http}.data.store.impl.mongo.MongoCrawlDataStoreFactory">
      <dbname>test</dbname>
    </crawlDataStoreFactory>
  </crawlerDefaults>

  <crawlers>

    <crawler id="test-en">
      <startURLs>
        <url>https://rustyx.org/temp/en/home</url>
      </startURLs>
      <referenceFilters>
        <filter class="${filter}.RegexReferenceFilter">https://rustyx.org/temp/.*</filter>
        <filter class="${filter}.RegexReferenceFilter" onMatch="exclude">https://rustyx.org/temp/nl/.*</filter>
      </referenceFilters>
      <urlNormalizer class="${http}.url.impl.GenericURLNormalizer">
        <replacements>
          <replace>
            <match>^(https://rustyx.org/temp/)(home.*)</match>
            <replacement>$1en/$2</replacement>
          </replace>
        </replacements>
      </urlNormalizer>
    </crawler>

    <crawler id="test-nl">
      <startURLs>
        <url>https://rustyx.org/temp/nl/home</url>
      </startURLs>
      <referenceFilters>
        <filter class="${filter}.RegexReferenceFilter">https://rustyx.org/temp/.*</filter>
        <filter class="${filter}.RegexReferenceFilter" onMatch="exclude">https://rustyx.org/temp/en/.*</filter>
      </referenceFilters>
      <urlNormalizer class="${http}.url.impl.GenericURLNormalizer">
        <replacements>
          <replace>
            <match>^(https://rustyx.org/temp/)(home.*)</match>
            <replacement>$1nl/$2</replacement>
          </replace>
        </replacements>
      </urlNormalizer>
    </crawler>

  </crawlers>

</httpcollector>
rustyx commented 6 years ago

Though even with a single crawler it is sometimes failing in MongoCrawlDataStore.queue().

com.mongodb.MongoWriteException: E11000 duplicate key error collection: test.references index: reference_1 dup key: { : "https://rustyx.org/temp/en/home" }
    at com.mongodb.MongoCollectionImpl.executeSingleWriteRequest(MongoCollectionImpl.java:1033)
    at com.mongodb.MongoCollectionImpl.executeUpdate(MongoCollectionImpl.java:1015)
    at com.mongodb.MongoCollectionImpl.updateOne(MongoCollectionImpl.java:635)
    at com.norconex.collector.core.data.store.impl.mongo.MongoCrawlDataStore.queue(MongoCrawlDataStore.java:186)
    at com.norconex.collector.core.pipeline.queue.QueueReferenceStage.execute(QueueReferenceStage.java:57)
    at com.norconex.collector.core.pipeline.queue.QueueReferenceStage.execute(QueueReferenceStage.java:1)
    at com.norconex.commons.lang.pipeline.Pipeline.execute(Pipeline.java:91)
    at com.norconex.collector.http.crawler.HttpCrawler.executeQueuePipeline(HttpCrawler.java:280)
    at com.norconex.collector.http.crawler.HttpCrawler.beforeFinalizeDocumentProcessing(HttpCrawler.java:432)
    at com.norconex.collector.core.crawler.AbstractCrawler.finalizeDocumentProcessing(AbstractCrawler.java:640)
    at com.norconex.collector.core.crawler.AbstractCrawler.processNextQueuedCrawlData(AbstractCrawler.java:557)
    at com.norconex.collector.core.crawler.AbstractCrawler.processNextReference(AbstractCrawler.java:419)
    at com.norconex.collector.core.crawler.AbstractCrawler$ProcessReferencesRunnable.run(AbstractCrawler.java:820)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
essiembre commented 6 years ago

Are you not using a Committer as well? One thing I noticed is both your crawlers point to the same Mongo DB. Make sure they can't overwrite each other by moving the mongo config under each crawler with a different DB name. To be safe, I would do the same with the "workdir".

rustyx commented 5 years ago

Yes I'm using the Elastic committer. It's irrelevant to this issue, since the issue is reproducible without a committer.

It seems like a MongoDB issue SERVER-14322 - when two clients independently upsert a duplicate document, one of the clients will get a constraint violation, as if upsert was false. It is not considered a bug (I disagree with that) and a suggested possible "solution" is to add a retry.