Closed FcrbPeter closed 5 years ago
Are you using the same DB for all your 6 crawlers? Each needs a distinct DB to avoid the duplicate key error.
Thanks for replying.
I am using one MongoDB server, but separated to 6 Databases to store each collectors data.
Is it ok?
Yes, that should then be fine other than that is quite a lot of threads. What version of HTTP Collector are you using? The snapshot version has a MongoDB fix. Not sure if it resolves this issue but worth a try.
The solution should be thread safe, but maybe you found a spot where it is not. Can you provide a way to reproduce the error at will?
You could reduce the threads to see if it still occurs. Otherwise, I would check if there is any negative impact. If it fails to add a URL that was already added because another thread just did it, chances are it won't affect anything. Other than getting these exceptions in your log, are you experiencing a loss of data or other unexpected failures?
I am using version 2.8.1. Should I change the version to the current snapshot version?
I have tried to reduce the thread. The error still occurs but the occurs time is fewer than before.
At the mean time, it started to occur exceptions with the connection to MongoDB and the MongoDB become to dead. The crawlers can reconnect to the MongoDB after it restart, but the crawler decided to stop due to "An error occured that could compromise the stability of the crawle" message.
Give you some references about the crawlers:
Three are 6 crawlers separated to 6 programs and 2 set of configs.
Config A - more threads
Crawler A: one crawler with 20 threads and 200 delay
Crawler B: one crawler with 20 threads and 200 delay
Crawler C: one crawler with 3 threads and 1200 delay
Crawler D: one crawler with 50 threads and 200 delay
Crawler E: one crawler with 50 threads and 200 delay
Crawler F: one crawler with 50 threads and 200 delay
Config B - fewer threads
Crawler A: one crawler with 10 threads and 300 delay
Crawler B: one crawler with 10 threads and 300 delay
Crawler C: one crawler with 3 threads and 1200 delay
Crawler D: one crawler with 20 threads and 200 delay
Crawler E: one crawler with 10 threads and 200 delay
Crawler F: one crawler with 30 threads and 200 delay
WARN [LinkExtractorStage] Could not queue extracted URL "{{ THE URL }}".
com.mongodb.MongoSocketReadException: Exception receiving message
at com.mongodb.connection.InternalStreamConnection.translateReadException(InternalStreamConnection.java:463)
at com.mongodb.connection.InternalStreamConnection.receiveMessage(InternalStreamConnection.java:214)
at com.mongodb.connection.UsageTrackingInternalConnection.receiveMessage(UsageTrackingInternalConnection.java:96)
at com.mongodb.connection.DefaultConnectionPool$PooledConnection.receiveMessage(DefaultConnectionPool.java:438)
at com.mongodb.connection.CommandProtocol.execute(CommandProtocol.java:105)
at com.mongodb.connection.DefaultServer$DefaultServerProtocolExecutor.execute(DefaultServer.java:159)
at com.mongodb.connection.DefaultServerConnection.executeProtocol(DefaultServerConnection.java:289)
at com.mongodb.connection.DefaultServerConnection.command(DefaultServerConnection.java:176)
at com.mongodb.operation.CommandOperationHelper.executeWrappedCommandProtocol(CommandOperationHelper.java:216)
at com.mongodb.operation.CommandOperationHelper.executeWrappedCommandProtocol(CommandOperationHelper.java:207)
at com.mongodb.operation.CommandOperationHelper.executeWrappedCommandProtocol(CommandOperationHelper.java:113)
at com.mongodb.operation.FindOperation$1.call(FindOperation.java:715)
at com.mongodb.operation.FindOperation$1.call(FindOperation.java:709)
at com.mongodb.operation.OperationHelper.withConnectionSource(OperationHelper.java:433)
at com.mongodb.operation.OperationHelper.withConnection(OperationHelper.java:406)
at com.mongodb.operation.FindOperation.execute(FindOperation.java:709)
at com.mongodb.operation.FindOperation.execute(FindOperation.java:81)
at com.mongodb.Mongo.execute(Mongo.java:810)
at com.mongodb.Mongo$2.execute(Mongo.java:797)
at com.mongodb.FindIterableImpl$FindOperationIterable.first(FindIterableImpl.java:273)
at com.mongodb.FindIterableImpl.first(FindIterableImpl.java:205)
at com.norconex.collector.core.data.store.impl.mongo.MongoCrawlDataStore.isStage(MongoCrawlDataStore.java:279)
at com.norconex.collector.core.data.store.impl.mongo.MongoCrawlDataStore.isActive(MongoCrawlDataStore.java:214)
at com.norconex.collector.core.pipeline.queue.QueueReferenceStage.execute(QueueReferenceStage.java:50)
at com.norconex.collector.core.pipeline.queue.QueueReferenceStage.execute(QueueReferenceStage.java:29)
at com.norconex.commons.lang.pipeline.Pipeline.execute(Pipeline.java:91)
at com.norconex.collector.http.pipeline.importer.LinkExtractorStage.queueURL(LinkExtractorStage.java:159)
at com.norconex.collector.http.pipeline.importer.LinkExtractorStage.executeStage(LinkExtractorStage.java:90)
at com.norconex.collector.http.pipeline.importer.AbstractImporterStage.execute(AbstractImporterStage.java:31)
at com.norconex.collector.http.pipeline.importer.AbstractImporterStage.execute(AbstractImporterStage.java:24)
at com.norconex.commons.lang.pipeline.Pipeline.execute(Pipeline.java:91)
at com.norconex.collector.http.crawler.HttpCrawler.executeImporterPipeline(HttpCrawler.java:360)
at com.norconex.collector.core.crawler.AbstractCrawler.processNextQueuedCrawlData(AbstractCrawler.java:538)
at com.norconex.collector.core.crawler.AbstractCrawler.processNextReference(AbstractCrawler.java:419)
at com.norconex.collector.core.crawler.AbstractCrawler$ProcessReferencesRunnable.run(AbstractCrawler.java:820)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.net.SocketException: Connection reset
at java.net.SocketInputStream.read(SocketInputStream.java:209)
at java.net.SocketInputStream.read(SocketInputStream.java:141)
at com.mongodb.connection.SocketStream.read(SocketStream.java:84)
at com.mongodb.connection.InternalStreamConnection.receiveResponseBuffers(InternalStreamConnection.java:474)
at com.mongodb.connection.InternalStreamConnection.receiveMessage(InternalStreamConnection.java:210)
... 36 more
INFO [SLF4JLogger] Exception in monitor thread while connecting to server mongo-0.mongo:27017
com.mongodb.MongoSocketException: mongo-0.mongo: Name or service not known
at com.mongodb.ServerAddress.getSocketAddress(ServerAddress.java:188)
at com.mongodb.connection.SocketStreamHelper.initialize(SocketStreamHelper.java:59)
at com.mongodb.connection.SocketStream.open(SocketStream.java:57)
at com.mongodb.connection.InternalStreamConnection.open(InternalStreamConnection.java:107)
at com.mongodb.connection.DefaultServerMonitor$ServerMonitorRunnable.run(DefaultServerMonitor.java:125)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.net.UnknownHostException: mongo-0.mongo: Name or service not known
at java.net.Inet6AddressImpl.lookupAllHostAddr(Native Method)
at java.net.InetAddress$2.lookupAllHostAddr(InetAddress.java:928)
at java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1323)
at java.net.InetAddress.getAllByName0(InetAddress.java:1276)
at java.net.InetAddress.getAllByName(InetAddress.java:1192)
at java.net.InetAddress.getAllByName(InetAddress.java:1126)
at java.net.InetAddress.getByName(InetAddress.java:1076)
at com.mongodb.ServerAddress.getSocketAddress(ServerAddress.java:186)
... 5 more
INFO [SLF4JLogger] Exception in monitor thread while connecting to server mongo-0.mongo:27017
com.mongodb.MongoSocketException: mongo-0.mongo
at com.mongodb.ServerAddress.getSocketAddress(ServerAddress.java:188)
at com.mongodb.connection.SocketStreamHelper.initialize(SocketStreamHelper.java:59)
at com.mongodb.connection.SocketStream.open(SocketStream.java:57)
at com.mongodb.connection.InternalStreamConnection.open(InternalStreamConnection.java:107)
at com.mongodb.connection.DefaultServerMonitor$ServerMonitorRunnable.run(DefaultServerMonitor.java:111)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.net.UnknownHostException: mongo-0.mongo
at java.net.InetAddress.getAllByName0(InetAddress.java:1280)
at java.net.InetAddress.getAllByName(InetAddress.java:1192)
at java.net.InetAddress.getAllByName(InetAddress.java:1126)
at java.net.InetAddress.getByName(InetAddress.java:1076)
at com.mongodb.ServerAddress.getSocketAddress(ServerAddress.java:186)
... 5 more
FATAL [AbstractCrawler$ProcessReferencesRunnable] webcrawler: An error occured that could compromise the stability of the crawler. Stopping excution to avoid further issues...
It is worth trying the latest snapshot, yes.
For MongoDB not responding, it seems like your installation cannot keep up. Short of boosting your MongoDB solution or being more "gentle" on it with different crawler configuration, you should be able to use resume
as the launch action instead of start
to have it pick-up where it stopped when your MongoDB is back online.
Thanks for replying and the suggestions.
I have one more question which is about connecting to the MongoDB Replica Set. Is it possible currently? If not, may I have a feature request about this?
I am not sure about Replica Set. I invite you to give it a try. If not working, feature requests are always welcome. Pull-requests even more. ;-)
I found that it is work in Replica Set and even a Shard Cluster. Thanks!
Hi,
I am using MongoDB as the datastore for the crawler. I has two questions for this.
duplicate key error
randomly. I am wondering if there are lots of threads to craw (around 100 threads). Is there any workaround?The detailed exception:
Thanks