Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
https://opensource.norconex.com/crawlers
Apache License 2.0
183 stars 67 forks source link

`Read timed out` in some channels #118

Closed schwipee closed 9 years ago

schwipee commented 9 years ago

Hey, i've an issue with Read time outs. For some channels it works perfectly but for others not. It only happen from time to time. I have no request in the log file for that time, so the crawler did not reach the url. I really don't know where the problem is.

XXXXReplication: 2015-06-09 01:10:46 ERROR - Cannot fetch document: XXXXXXXXX/?SearchTerm=* (Read timed out) XXXXReplication: 2015-06-09 01:10:46 ERROR - XXXXReplication: Could not process document: XXXXXXXXX/?SearchTerm=* (java.net.SocketTimeoutException: Read timed out) com.norconex.collector.core.CollectorException: java.net.SocketTimeoutException: Read timed out at com.norconex.collector.http.fetch.impl.GenericDocumentFetcher.fetchDocument(GenericDocumentFetcher.java:148) at com.norconex.collector.http.pipeline.importer.HttpImporterPipeline$DocumentFetcherStage.executeStage(HttpImporterPipeline.java:147) at com.norconex.collector.http.pipeline.importer.AbstractImporterStage.execute(AbstractImporterStage.java:31)

Best regards

essiembre commented 9 years ago

There could be multiple causes. Maybe it is a sign that you cannot connect properly. Have you tried to connect via a browser running on the same machine where the collector runs? If that works, is your browser going through a proxy? See if you can connect to the URLs you are trying to crawl with command-line tools like wget or curl.

The URL fragment you pasted seems to be pointing to a search engine. Is it possible the search takes too long to return? If so, you may want to try increasing the timeout values by configuring GenericHttpClientFactory accordingly.

If all these tests work and it only fails with the HTTP Collector, please paste the URL so we can try to reproduce.

schwipee commented 9 years ago

Hey essiembre, Thanks for your help! After we increase the timeout no read timeout occurs anymore. But we have another issue with timeouts "Timeout waiting for connection from pool". Look like this: ` om.norconex.collector.core.CollectorException: org.apache.http.conn.ConnectionPoolTimeoutException: Timeout waiting for connection from pool at com.norconex.collector.http.fetch.impl.GenericDocumentFetcher.fetchDocument(GenericDocumentFetcher.java:148) at com.norconex.collector.http.pipeline.importer.HttpImporterPipeline$DocumentFetcherStage.executeStage(HttpImporterPipeline.java:147) at com.norconex.collector.http.pipeline.importer.AbstractImporterStage.execute(AbstractImporterStage.java:31) at com.norconex.collector.http.pipeline.importer.AbstractImporterStage.execute(AbstractImporterStage.java:24) at com.norconex.commons.lang.pipeline.Pipeline.execute(Pipeline.java:90) at com.norconex.collector.http.crawler.HttpCrawler.executeImporterPipeline(HttpCrawler.java:213) at com.norconex.collector.core.crawler.AbstractCrawler.processNextQueuedCrawlData(AbstractCrawler.java:473) at com.norconex.collector.core.crawler.AbstractCrawler.processNextReference(AbstractCrawler.java:373) at com.norconex.collector.core.crawler.AbstractCrawler$ProcessURLsRunnable.run(AbstractCrawler.java:631) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: org.apache.http.conn.ConnectionPoolTimeoutException: Timeout waiting for connection from pool at org.apache.http.impl.conn.PoolingHttpClientConnectionManager.leaseConnection(PoolingHttpClientConnectionManager.java:286) at org.apache.http.impl.conn.PoolingHttpClientConnectionManager$1.get(PoolingHttpClientConnectionManager.java:263) at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:190) at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:184) at org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:88) at org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:110) at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:184) at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82) at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:107) at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:55) at com.norconex.collector.http.fetch.impl.GenericDocumentFetcher.fetchDocument(GenericDocumentFetcher.java:94) ... 11 more ´ Do you know where does it come from? (Apache connectiontimeouts? ) Best regards

essiembre commented 9 years ago

One of the reason may be an HTTP Connection leak, but it's the first time its being reported. How many threads are you using? Do you mind sharing your config so we can try to reproduce?

schwipee commented 9 years ago

Hey Pascal, we had 60 threads and 100 maxConnections. Our config look like this:

<?xml version="1.0" encoding="UTF-8"?>
<!-- 
   Copyright 2010-2014 Norconex Inc.

   Licensed under the Apache License, Version 2.0 (the "License");
   you may not use this file except in compliance with the License.
   You may obtain a copy of the License at

       http://www.apache.org/licenses/LICENSE-2.0

   Unless required by applicable law or agreed to in writing, software
   distributed under the License is distributed on an "AS IS" BASIS,
   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
   See the License for the specific language governing permissions and
   limitations under the License.
-->

<httpcollector id="Replication">
  #set($workdir = "/opt/crawler/output")
  <!-- Decide where to store generated files. -->
  <progressDir>$workdir/progress</progressDir>
  <logsDir>$workdir/logs</logsDir>

  <crawlerDefaults>    
    <!-- Filter BEFORE download with RobotsTxt rules. Classes must
     implement *.robot.IRobotsTxtProvider.  Default implementation
     is the following.
    -->
    <robotsTxt ignore="true" class="com.norconex.collector.http.robot.impl.StandardRobotsTxtProvider"/>

    <!-- Establish whether to follow a page URLs or to index a given page
     based on in-page meta tag robot information. Classes must implement 
     com.norconex.collector.http.robot.IRobotsMetaProvider.  
     Default implementation is the following.
    -->
    <robotsMeta ignore="true" class="com.norconex.collector.http.robot.impl.StandardRobotsMetaProvider" />
    <userAgent>Crawler</userAgent>
    <numThreads>60</numThreads>

    <!-- Put a maximum depth to avoid infinite crawling (e.g. calendars). -->
    <maxDepth>3</maxDepth>

    <!-- Be as nice as you can to sites you crawl. -->
    <delay default="10" />

    <httpClientFactory class="com.norconex.collector.http.client.impl.GenericHttpClientFactory">
        <maxConnections>100</maxConnections>
        <connectionTimeout>300000</connectionTimeout>
    </httpClientFactory>    
  </crawlerDefaults>

  <crawlers>
    <crawler id="Replication">

      <!-- Where the crawler default directory to generate files is. -->
      <workDir>$workdir</workDir>

      <!-- Requires at least one start URL. -->
      <startURLs>
        #parse("/opt/crawler/conf.local/replicationgroups/Replication-starturls.xml")
      </startURLs>

      <!-- At a minimum make sure you stay on your domain. -->
      <referenceFilters>
        #parse("/opt/crawler/conf/shared-crawler-filters.xml")      
      </referenceFilters>   
      <linkExtractors>      
        <extractor class="com.norconex.collector.http.url.impl.HtmlLinkExtractor">
            <!-- Which tags and attributes hold the URLs to extract -->
            <tags>
                <tag name="script" attribute="src" />
                <tag name="a" attribute="href" />
                <tag name="frame" attribute="src" />
                <tag name="iframe" attribute="src" />
                <tag name="img" attribute="src" />
                <tag name="meta" attribute="http-equiv" />
                <tag name="link" attribute="href" />
            </tags>
        </extractor>
      </linkExtractors>
    </crawler>
  </crawlers>
</httpcollector>

Best Peer

essiembre commented 9 years ago

That's a lot of threads! :-) I am curious to know what kind of machine you are using.

After some digging, it appears that sometimes the HTTP client is unable to detect that the server side closed the socket (usually unexpectedly). In which case it becomes stalled and remains in the pool without the socket being closed (and can't be used by another request).

I suspect this is what's happening to you here. And since you are performing an aggressive crawl, it is very possible it does not have time to realize the socket is dead on a connection before it could making it available to another request again... so it opens a new connection, and you eventually run out.

Starting with 4.4, Apache HttpClient added a feature so it can periodically validate if a connection is staled before leasing a new one after a certain period of inactivity. By default it does not perform that check. I will mark this ticket as a feature request to make this a configurable setting in the next release (with the addition of new features added in most recent version of HttpClient).

In the meantime, I recommend you try setting a <socketTimeout> value so it will close open socket after too long no matter what.

All this being said, the above may not resolve your issue entirely because you may be doing too much in too little time. I mean, potentially up to 60 requests fired each 10 milliseconds means if it takes a few seconds before any socket connection is detected to be closed, you will soon have much more requests that you have free connections available (your existing timeouts being 30 seconds).

But at the same time, if you make your timeouts too low you can get several connection errors. So I would recommend making the max number of connections much higher if you want to keep the same crawl aggressiveness. Otherwise, or in addition, increase the delay a bit. See GenericDelayResolver.

I like that you are taking advantage of configuration fragments (#parse). I noticed you are also including a config fragment for your start URLs. You may be interested to know you can use a <urlsFile> tag instead of including a config fragment (or in addition to). That tag allows you to specify a flat file, with 1 URL per line, and no need to XML-escape (i.e. a seed file). It is usually easier to maintain when you have a long list.

It is nice to see you push the limits of the HTTP Collector. Please keep us informed of your benchmarks.

OkkeKlein commented 9 years ago

Ok. It appears I'm running into this problem as well (java.net.SocketTimeoutException: Read timed out).(Timeout waiting for connection from pool). I am only crawling 4 threads. Using the code from 2 July. `

60000
  <socketTimeout>60000</socketTimeout>
  <connectionRequestTimeout>30000</connectionRequestTimeout>

`

schwipee commented 9 years ago

Hey Pascal, we have almost the same number of exceptions with 150 and 850 MaxConnections. Do you have another idea?

Best

OkkeKlein commented 9 years ago

@schwipee you might wanna try and lower <connectionTimeout>300000</connectionTimeout> to <connectionTimeout>30000</connectionTimeout> 5 minutes is a long time to wait for a connection

Also try and set <connectionRequestTimeout>0</connectionRequestTimeout> So it will wait for a connection from manager indefinitely.

OkkeKlein commented 9 years ago

Tested it myself and there was some improvement, but not enough `30000

30000
  <connectionRequestTimeout>0</connectionRequestTimeout>`` 

30 seconds socketTimeout should be more then enough. But still getting read timeouts.

OkkeKlein commented 9 years ago

New test with all timeouts at 60000 ms results in much better results. Also not getting the connection pool exceptions anymore, just read timeouts.

Why the value has to be this large puzzles me, as website should respond much quicker.

essiembre commented 9 years ago

The bottleneck is the number of connections available in your pool at any given time vs the number of threads asking for one. The connections do not become available/released fast enough when you request too many/too quickly. I think it is a math issue at this point. Even with the future enabling the new staled-check feature introduced in a recent version of HttpClient, it may not solve it entirely. The only way I can prevent these exception 100% is to have each thread wait until a new connection becomes available again as a default behavior. Not ideal as it may hide a problem with your config. Not sure what's best here.

Have you tried putting an insanely high number of for <maxConnections>? If you put something like 10000000 (or whatever OS maximum), maybe it will never reach that maximum before they start expiring naturally after whatever timeout your provide. Something worth trying.

As for site response time, maybe the site does not respond as fast when you issue 60 requests at once, or maybe it is the local process that can't handle has much data simultaneously and causes the delays (CPU, bandwidth, etc).

OkkeKlein commented 9 years ago

I only run crawl with 4 threads, So 30 second timeouts should not happen. Let alone 60 seconds. But setting all timeouts to 60 seconds at least resolved my problem for most part.

essiembre commented 9 years ago

Can one of you share the site(s) you are crawling and possibly the minimum config required to reproduce? I cannot reproduce with 10 threads and 10 milliseconds delay (using default values for various configurable timeouts).

essiembre commented 9 years ago

You can try the latest snapshot. GenericHttpClientFactory now has these three new configuration options (from latest Apache HttpClient library):

<maxConnectionsPerRoute>...</maxConnectionsPerRoute>
<maxConnectionIdleTime>(milliseconds)</maxConnectionIdleTime>
<maxConnectionInactiveTime>(milliseconds)</maxConnectionInactiveTime>

maxConnectionsPerRoute seems to indicate how many connections can be made on the same target. Default used to be 2. It is now 20. maxConnectionIdleTime will evict connections that have been idle in the pool for the time specified. maxConnectionInactiveTime tells Apache HttpClient to check if a connection is stalled after it has been inactive for the time you supply.

In addition, the default max connections was 20. It is now 200.

Hopefully these changes and additional options will help to get rid of your timeout issues. If they persist, let me know and we can create a new feature request to retry failed connections a configurable number of times.

schwipee commented 9 years ago

Hey Pascal, thanks for pushing the new config options. After a few test, we have no timeouts anymore. In addition the new version/configs is much faster (2 minutes instead of 30-45 minutes). For me the ticket can be closed! Best regards

OkkeKlein commented 9 years ago

@schwipee can you share some information about the settings used?

essiembre commented 9 years ago

@schwipee: Awesome! Thanks for trying and sharing the results with us. I am happy to close it.