ConnectTimeoutException => crawling is stopped?

liar666 commented 7 years ago

Hi, I'm running a crawler for days now. Apparently, a TimeOut occurred on one page and the crawler is stopped for more than 2 hours...

Is that the expected/normal behaviour? Isn't the problematic URL supposed to be put back in queue and the crawler continue it's job?

For more information, you'll find an extract of the log below: Oct 20, 2016 12:02:16 PM .... INFO [CrawlerEventManager] DOCUMENT_COMMITTED_ADD: http://www.freepatentsonline.com/2529443.html ERROR [GenericDocumentFetcher] Cannot fetch document: http://www.freepatentsonline.com/2061740.html (Connect to www.freepatentsonline.com:80 [www.freepatentsonline.com/144.202.252.20] failed: connect timed out) INFO [CrawlerEventManager] REJECTED_ERROR: http://www.freepatentsonline.com/2061740.html ERROR [AbstractCrawler] Freepatentsonline: Could not process document: http://www.freepatentsonline.com/2061740.html (org.apache.http.conn.ConnectTimeoutException: Connect to www.freepatentsonline.com:80 [www.freepatentsonline.com/144.202.252.20] failed: connect timed out) com.norconex.collector.core.CollectorException: org.apache.http.conn.ConnectTimeoutException: Connect to www.freepatentsonline.com:80 [www.freepatentsonline.com/144.202.252.20] failed: connect timed out at com.norconex.collector.http.fetch.impl.GenericDocumentFetcher.fetchDocument(GenericDocumentFetcher.java:171) at com.norconex.collector.http.pipeline.importer.DocumentFetcherStage.executeStage(DocumentFetcherStage.java:42) at com.norconex.collector.http.pipeline.importer.AbstractImporterStage.execute(AbstractImporterStage.java:31) at com.norconex.collector.http.pipeline.importer.AbstractImporterStage.execute(AbstractImporterStage.java:24) at com.norconex.commons.lang.pipeline.Pipeline.execute(Pipeline.java:91) at com.norconex.collector.http.crawler.HttpCrawler.executeImporterPipeline(HttpCrawler.java:300) at com.norconex.collector.core.crawler.AbstractCrawler.processNextQueuedCrawlData(AbstractCrawler.java:488) at com.norconex.collector.core.crawler.AbstractCrawler.processNextReference(AbstractCrawler.java:378) at com.norconex.collector.core.crawler.AbstractCrawler$ProcessReferencesRunnable.run(AbstractCrawler.java:736) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: org.apache.http.conn.ConnectTimeoutException: Connect to www.freepatentsonline.com:80 [www.freepatentsonline.com/144.202.252.20] failed: connect timed out at org.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:150) at org.apache.http.impl.conn.PoolingHttpClientConnectionManager.connect(PoolingHttpClientConnectionManager.java:353) at org.apache.http.impl.execchain.MainClientExec.establishRoute(MainClientExec.java:380) at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:236) at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:184) at org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:88) at org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:110) at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:184) at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82) at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:107) at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:55) at com.norconex.collector.http.fetch.impl.GenericDocumentFetcher.fetchDocument(GenericDocumentFetcher.java:110) ... 11 more Caused by: java.net.SocketTimeoutException: connect timed out at java.net.PlainSocketImpl.socketConnect(Native Method) at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350) at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206) at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188) at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) at java.net.Socket.connect(Socket.java:589) at org.apache.http.conn.socket.PlainConnectionSocketFactory.connectSocket(PlainConnectionSocketFactory.java:74) at org.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:141) ... 22 more

It's now: Oct 20, 2016 14:16:56 PM ....

essiembre commented 7 years ago

It probably hangs since the thread with the stalled connection keeps waiting for it to return. I recommend you try explicitly configure timeout settings on HTTP connections being made. There is a handful of timeout-related configuration options in GenericHttpClientFactory. Here they are (to put in your crawler config):

  <httpClientFactory class="com.norconex.collector.http.client.impl.GenericHttpClientFactory">
      <connectionTimeout>(milliseconds)</connectionTimeout>
      <socketTimeout>(milliseconds)</socketTimeout>
      <connectionRequestTimeout>(milliseconds)</connectionRequestTimeout>
      <maxConnectionIdleTime>(milliseconds)</maxConnectionIdleTime>
      <maxConnectionInactiveTime>(milliseconds)</maxConnectionInactiveTime>
  </httpClientFactory>

Please confirm whether that makes a difference.

essiembre commented 7 years ago

Closing due to lack of feedback.

Norconex / crawlers

ConnectTimeoutException => crawling is stopped? #305