Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
https://opensource.norconex.com/crawlers
Apache License 2.0
183 stars 67 forks source link

Using with MongoDB cluster #589

Closed FcrbPeter closed 5 years ago

FcrbPeter commented 5 years ago

Hi,

I am using MongoDB as the datastore for the crawler. I has two questions for this.

  1. The crawler shows up the duplicate key error randomly. I am wondering if there are lots of threads to craw (around 100 threads). Is there any workaround?

The detailed exception:

com.mongodb.MongoWriteException: E11000 duplicate key error collection: webcrawler-ec.references index: reference_1 dup key: { : "THE URL" }
    at com.mongodb.MongoCollectionImpl.executeSingleWriteRequest(MongoCollectionImpl.java:558)
    at com.mongodb.MongoCollectionImpl.update(MongoCollectionImpl.java:542)
    at com.mongodb.MongoCollectionImpl.updateOne(MongoCollectionImpl.java:381)
    at com.norconex.collector.core.data.store.impl.mongo.MongoCrawlDataStore.queue(MongoCrawlDataStore.java:186)
    at com.norconex.collector.core.pipeline.queue.QueueReferenceStage.execute(QueueReferenceStage.java:57)
    at com.norconex.collector.core.pipeline.queue.QueueReferenceStage.execute(QueueReferenceStage.java:29)
    at com.norconex.commons.lang.pipeline.Pipeline.execute(Pipeline.java:91)
    at com.norconex.collector.http.pipeline.importer.LinkExtractorStage.queueURL(LinkExtractorStage.java:159)
    at com.norconex.collector.http.pipeline.importer.LinkExtractorStage.executeStage(LinkExtractorStage.java:90)
    at com.norconex.collector.http.pipeline.importer.AbstractImporterStage.execute(AbstractImporterStage.java:31)
    at com.norconex.collector.http.pipeline.importer.AbstractImporterStage.execute(AbstractImporterStage.java:24)
    at com.norconex.commons.lang.pipeline.Pipeline.execute(Pipeline.java:91)
    at com.norconex.collector.http.crawler.HttpCrawler.executeImporterPipeline(HttpCrawler.java:360)
    at com.norconex.collector.core.crawler.AbstractCrawler.processNextQueuedCrawlData(AbstractCrawler.java:538)
    at com.norconex.collector.core.crawler.AbstractCrawler.processNextReference(AbstractCrawler.java:419)
    at com.norconex.collector.core.crawler.AbstractCrawler$ProcessReferencesRunnable.run(AbstractCrawler.java:820)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
  1. I am using about 6 crawlers with around 100 threads in each crawlers (delay: 300). There would be about 600 connections with a high IO transfer. What if I use the MongoDB cluster? Is there any pros to use the MongoDB cluster?

Thanks

essiembre commented 5 years ago

Are you using the same DB for all your 6 crawlers? Each needs a distinct DB to avoid the duplicate key error.

FcrbPeter commented 5 years ago

Thanks for replying.

I am using one MongoDB server, but separated to 6 Databases to store each collectors data.

Is it ok?

essiembre commented 5 years ago

Yes, that should then be fine other than that is quite a lot of threads. What version of HTTP Collector are you using? The snapshot version has a MongoDB fix. Not sure if it resolves this issue but worth a try.

The solution should be thread safe, but maybe you found a spot where it is not. Can you provide a way to reproduce the error at will?

You could reduce the threads to see if it still occurs. Otherwise, I would check if there is any negative impact. If it fails to add a URL that was already added because another thread just did it, chances are it won't affect anything. Other than getting these exceptions in your log, are you experiencing a loss of data or other unexpected failures?

FcrbPeter commented 5 years ago

I am using version 2.8.1. Should I change the version to the current snapshot version?

I have tried to reduce the thread. The error still occurs but the occurs time is fewer than before.

At the mean time, it started to occur exceptions with the connection to MongoDB and the MongoDB become to dead. The crawlers can reconnect to the MongoDB after it restart, but the crawler decided to stop due to "An error occured that could compromise the stability of the crawle" message.

Give you some references about the crawlers:

Three are 6 crawlers separated to 6 programs and 2 set of configs.

Config A - more threads
Crawler A: one crawler with 20 threads and 200 delay
Crawler B: one crawler with 20 threads and 200 delay
Crawler C: one crawler with 3 threads and 1200 delay
Crawler D: one crawler with 50 threads and 200 delay
Crawler E: one crawler with 50 threads and 200 delay
Crawler F: one crawler with 50 threads and 200 delay

Config B - fewer threads
Crawler A: one crawler with 10 threads and 300 delay
Crawler B: one crawler with 10 threads and 300 delay
Crawler C: one crawler with 3 threads and 1200 delay
Crawler D: one crawler with 20 threads and 200 delay
Crawler E: one crawler with 10 threads and 200 delay
Crawler F: one crawler with 30 threads and 200 delay
WARN  [LinkExtractorStage] Could not queue extracted URL "{{ THE URL }}".
com.mongodb.MongoSocketReadException: Exception receiving message
        at com.mongodb.connection.InternalStreamConnection.translateReadException(InternalStreamConnection.java:463)
        at com.mongodb.connection.InternalStreamConnection.receiveMessage(InternalStreamConnection.java:214)
        at com.mongodb.connection.UsageTrackingInternalConnection.receiveMessage(UsageTrackingInternalConnection.java:96)
        at com.mongodb.connection.DefaultConnectionPool$PooledConnection.receiveMessage(DefaultConnectionPool.java:438)
        at com.mongodb.connection.CommandProtocol.execute(CommandProtocol.java:105)
        at com.mongodb.connection.DefaultServer$DefaultServerProtocolExecutor.execute(DefaultServer.java:159)
        at com.mongodb.connection.DefaultServerConnection.executeProtocol(DefaultServerConnection.java:289)
        at com.mongodb.connection.DefaultServerConnection.command(DefaultServerConnection.java:176)
        at com.mongodb.operation.CommandOperationHelper.executeWrappedCommandProtocol(CommandOperationHelper.java:216)
        at com.mongodb.operation.CommandOperationHelper.executeWrappedCommandProtocol(CommandOperationHelper.java:207)
        at com.mongodb.operation.CommandOperationHelper.executeWrappedCommandProtocol(CommandOperationHelper.java:113)
        at com.mongodb.operation.FindOperation$1.call(FindOperation.java:715)
        at com.mongodb.operation.FindOperation$1.call(FindOperation.java:709)
        at com.mongodb.operation.OperationHelper.withConnectionSource(OperationHelper.java:433)
        at com.mongodb.operation.OperationHelper.withConnection(OperationHelper.java:406)
        at com.mongodb.operation.FindOperation.execute(FindOperation.java:709)
        at com.mongodb.operation.FindOperation.execute(FindOperation.java:81)
        at com.mongodb.Mongo.execute(Mongo.java:810)
        at com.mongodb.Mongo$2.execute(Mongo.java:797)
        at com.mongodb.FindIterableImpl$FindOperationIterable.first(FindIterableImpl.java:273)
        at com.mongodb.FindIterableImpl.first(FindIterableImpl.java:205)
        at com.norconex.collector.core.data.store.impl.mongo.MongoCrawlDataStore.isStage(MongoCrawlDataStore.java:279)
        at com.norconex.collector.core.data.store.impl.mongo.MongoCrawlDataStore.isActive(MongoCrawlDataStore.java:214)
        at com.norconex.collector.core.pipeline.queue.QueueReferenceStage.execute(QueueReferenceStage.java:50)
        at com.norconex.collector.core.pipeline.queue.QueueReferenceStage.execute(QueueReferenceStage.java:29)
        at com.norconex.commons.lang.pipeline.Pipeline.execute(Pipeline.java:91)
        at com.norconex.collector.http.pipeline.importer.LinkExtractorStage.queueURL(LinkExtractorStage.java:159)
        at com.norconex.collector.http.pipeline.importer.LinkExtractorStage.executeStage(LinkExtractorStage.java:90)
        at com.norconex.collector.http.pipeline.importer.AbstractImporterStage.execute(AbstractImporterStage.java:31)
        at com.norconex.collector.http.pipeline.importer.AbstractImporterStage.execute(AbstractImporterStage.java:24)
        at com.norconex.commons.lang.pipeline.Pipeline.execute(Pipeline.java:91)
        at com.norconex.collector.http.crawler.HttpCrawler.executeImporterPipeline(HttpCrawler.java:360)
        at com.norconex.collector.core.crawler.AbstractCrawler.processNextQueuedCrawlData(AbstractCrawler.java:538)
        at com.norconex.collector.core.crawler.AbstractCrawler.processNextReference(AbstractCrawler.java:419)
        at com.norconex.collector.core.crawler.AbstractCrawler$ProcessReferencesRunnable.run(AbstractCrawler.java:820)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
Caused by: java.net.SocketException: Connection reset
        at java.net.SocketInputStream.read(SocketInputStream.java:209)
        at java.net.SocketInputStream.read(SocketInputStream.java:141)
        at com.mongodb.connection.SocketStream.read(SocketStream.java:84)
        at com.mongodb.connection.InternalStreamConnection.receiveResponseBuffers(InternalStreamConnection.java:474)
        at com.mongodb.connection.InternalStreamConnection.receiveMessage(InternalStreamConnection.java:210)
        ... 36 more
INFO  [SLF4JLogger] Exception in monitor thread while connecting to server mongo-0.mongo:27017
com.mongodb.MongoSocketException: mongo-0.mongo: Name or service not known
        at com.mongodb.ServerAddress.getSocketAddress(ServerAddress.java:188)
        at com.mongodb.connection.SocketStreamHelper.initialize(SocketStreamHelper.java:59)
        at com.mongodb.connection.SocketStream.open(SocketStream.java:57)
        at com.mongodb.connection.InternalStreamConnection.open(InternalStreamConnection.java:107)
        at com.mongodb.connection.DefaultServerMonitor$ServerMonitorRunnable.run(DefaultServerMonitor.java:125)
        at java.lang.Thread.run(Thread.java:745)
Caused by: java.net.UnknownHostException: mongo-0.mongo: Name or service not known
        at java.net.Inet6AddressImpl.lookupAllHostAddr(Native Method)
        at java.net.InetAddress$2.lookupAllHostAddr(InetAddress.java:928)
        at java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1323)
        at java.net.InetAddress.getAllByName0(InetAddress.java:1276)
        at java.net.InetAddress.getAllByName(InetAddress.java:1192)
        at java.net.InetAddress.getAllByName(InetAddress.java:1126)
        at java.net.InetAddress.getByName(InetAddress.java:1076)
        at com.mongodb.ServerAddress.getSocketAddress(ServerAddress.java:186)
        ... 5 more
INFO  [SLF4JLogger] Exception in monitor thread while connecting to server mongo-0.mongo:27017
com.mongodb.MongoSocketException: mongo-0.mongo
        at com.mongodb.ServerAddress.getSocketAddress(ServerAddress.java:188)
        at com.mongodb.connection.SocketStreamHelper.initialize(SocketStreamHelper.java:59)
        at com.mongodb.connection.SocketStream.open(SocketStream.java:57)
        at com.mongodb.connection.InternalStreamConnection.open(InternalStreamConnection.java:107)
        at com.mongodb.connection.DefaultServerMonitor$ServerMonitorRunnable.run(DefaultServerMonitor.java:111)
        at java.lang.Thread.run(Thread.java:745)
Caused by: java.net.UnknownHostException: mongo-0.mongo
        at java.net.InetAddress.getAllByName0(InetAddress.java:1280)
        at java.net.InetAddress.getAllByName(InetAddress.java:1192)
        at java.net.InetAddress.getAllByName(InetAddress.java:1126)
        at java.net.InetAddress.getByName(InetAddress.java:1076)
        at com.mongodb.ServerAddress.getSocketAddress(ServerAddress.java:186)
        ... 5 more
FATAL [AbstractCrawler$ProcessReferencesRunnable] webcrawler: An error occured that could compromise the stability of the crawler. Stopping excution to avoid further issues...
essiembre commented 5 years ago

It is worth trying the latest snapshot, yes.

For MongoDB not responding, it seems like your installation cannot keep up. Short of boosting your MongoDB solution or being more "gentle" on it with different crawler configuration, you should be able to use resume as the launch action instead of start to have it pick-up where it stopped when your MongoDB is back online.

FcrbPeter commented 5 years ago

Thanks for replying and the suggestions.

I have one more question which is about connecting to the MongoDB Replica Set. Is it possible currently? If not, may I have a feature request about this?

essiembre commented 5 years ago

I am not sure about Replica Set. I invite you to give it a try. If not working, feature requests are always welcome. Pull-requests even more. ;-)

FcrbPeter commented 5 years ago

I found that it is work in Replica Set and even a Shard Cluster. Thanks!