Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
https://opensource.norconex.com/crawlers
Apache License 2.0
183 stars 68 forks source link

ConnectTimeoutException => crawling is stopped? #305

Closed liar666 closed 7 years ago

liar666 commented 7 years ago

Hi, I'm running a crawler for days now. Apparently, a TimeOut occurred on one page and the crawler is stopped for more than 2 hours...

Is that the expected/normal behaviour? Isn't the problematic URL supposed to be put back in queue and the crawler continue it's job?

For more information, you'll find an extract of the log below: Oct 20, 2016 12:02:16 PM .... INFO [CrawlerEventManager] DOCUMENT_COMMITTED_ADD: http://www.freepatentsonline.com/2529443.html ERROR [GenericDocumentFetcher] Cannot fetch document: http://www.freepatentsonline.com/2061740.html (Connect to www.freepatentsonline.com:80 [www.freepatentsonline.com/144.202.252.20] failed: connect timed out) INFO [CrawlerEventManager] REJECTED_ERROR: http://www.freepatentsonline.com/2061740.html ERROR [AbstractCrawler] Freepatentsonline: Could not process document: http://www.freepatentsonline.com/2061740.html (org.apache.http.conn.ConnectTimeoutException: Connect to www.freepatentsonline.com:80 [www.freepatentsonline.com/144.202.252.20] failed: connect timed out) com.norconex.collector.core.CollectorException: org.apache.http.conn.ConnectTimeoutException: Connect to www.freepatentsonline.com:80 [www.freepatentsonline.com/144.202.252.20] failed: connect timed out at com.norconex.collector.http.fetch.impl.GenericDocumentFetcher.fetchDocument(GenericDocumentFetcher.java:171) at com.norconex.collector.http.pipeline.importer.DocumentFetcherStage.executeStage(DocumentFetcherStage.java:42) at com.norconex.collector.http.pipeline.importer.AbstractImporterStage.execute(AbstractImporterStage.java:31) at com.norconex.collector.http.pipeline.importer.AbstractImporterStage.execute(AbstractImporterStage.java:24) at com.norconex.commons.lang.pipeline.Pipeline.execute(Pipeline.java:91) at com.norconex.collector.http.crawler.HttpCrawler.executeImporterPipeline(HttpCrawler.java:300) at com.norconex.collector.core.crawler.AbstractCrawler.processNextQueuedCrawlData(AbstractCrawler.java:488) at com.norconex.collector.core.crawler.AbstractCrawler.processNextReference(AbstractCrawler.java:378) at com.norconex.collector.core.crawler.AbstractCrawler$ProcessReferencesRunnable.run(AbstractCrawler.java:736) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: org.apache.http.conn.ConnectTimeoutException: Connect to www.freepatentsonline.com:80 [www.freepatentsonline.com/144.202.252.20] failed: connect timed out at org.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:150) at org.apache.http.impl.conn.PoolingHttpClientConnectionManager.connect(PoolingHttpClientConnectionManager.java:353) at org.apache.http.impl.execchain.MainClientExec.establishRoute(MainClientExec.java:380) at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:236) at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:184) at org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:88) at org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:110) at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:184) at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82) at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:107) at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:55) at com.norconex.collector.http.fetch.impl.GenericDocumentFetcher.fetchDocument(GenericDocumentFetcher.java:110) ... 11 more Caused by: java.net.SocketTimeoutException: connect timed out at java.net.PlainSocketImpl.socketConnect(Native Method) at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350) at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206) at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188) at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) at java.net.Socket.connect(Socket.java:589) at org.apache.http.conn.socket.PlainConnectionSocketFactory.connectSocket(PlainConnectionSocketFactory.java:74) at org.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:141) ... 22 more

It's now: Oct 20, 2016 14:16:56 PM ....

essiembre commented 7 years ago

It probably hangs since the thread with the stalled connection keeps waiting for it to return. I recommend you try explicitly configure timeout settings on HTTP connections being made. There is a handful of timeout-related configuration options in GenericHttpClientFactory. Here they are (to put in your crawler config):

  <httpClientFactory class="com.norconex.collector.http.client.impl.GenericHttpClientFactory">
      <connectionTimeout>(milliseconds)</connectionTimeout>
      <socketTimeout>(milliseconds)</socketTimeout>
      <connectionRequestTimeout>(milliseconds)</connectionRequestTimeout>
      <maxConnectionIdleTime>(milliseconds)</maxConnectionIdleTime>
      <maxConnectionInactiveTime>(milliseconds)</maxConnectionInactiveTime>
  </httpClientFactory>

Please confirm whether that makes a difference.

essiembre commented 7 years ago

Closing due to lack of feedback.