Closed schwipee closed 9 years ago
There could be multiple causes. Maybe it is a sign that you cannot connect properly. Have you tried to connect via a browser running on the same machine where the collector runs? If that works, is your browser going through a proxy? See if you can connect to the URLs you are trying to crawl with command-line tools like wget
or curl
.
The URL fragment you pasted seems to be pointing to a search engine. Is it possible the search takes too long to return? If so, you may want to try increasing the timeout values by configuring GenericHttpClientFactory accordingly.
If all these tests work and it only fails with the HTTP Collector, please paste the URL so we can try to reproduce.
Hey essiembre, Thanks for your help! After we increase the timeout no read timeout occurs anymore. But we have another issue with timeouts "Timeout waiting for connection from pool". Look like this: ` om.norconex.collector.core.CollectorException: org.apache.http.conn.ConnectionPoolTimeoutException: Timeout waiting for connection from pool at com.norconex.collector.http.fetch.impl.GenericDocumentFetcher.fetchDocument(GenericDocumentFetcher.java:148) at com.norconex.collector.http.pipeline.importer.HttpImporterPipeline$DocumentFetcherStage.executeStage(HttpImporterPipeline.java:147) at com.norconex.collector.http.pipeline.importer.AbstractImporterStage.execute(AbstractImporterStage.java:31) at com.norconex.collector.http.pipeline.importer.AbstractImporterStage.execute(AbstractImporterStage.java:24) at com.norconex.commons.lang.pipeline.Pipeline.execute(Pipeline.java:90) at com.norconex.collector.http.crawler.HttpCrawler.executeImporterPipeline(HttpCrawler.java:213) at com.norconex.collector.core.crawler.AbstractCrawler.processNextQueuedCrawlData(AbstractCrawler.java:473) at com.norconex.collector.core.crawler.AbstractCrawler.processNextReference(AbstractCrawler.java:373) at com.norconex.collector.core.crawler.AbstractCrawler$ProcessURLsRunnable.run(AbstractCrawler.java:631) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: org.apache.http.conn.ConnectionPoolTimeoutException: Timeout waiting for connection from pool at org.apache.http.impl.conn.PoolingHttpClientConnectionManager.leaseConnection(PoolingHttpClientConnectionManager.java:286) at org.apache.http.impl.conn.PoolingHttpClientConnectionManager$1.get(PoolingHttpClientConnectionManager.java:263) at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:190) at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:184) at org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:88) at org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:110) at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:184) at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82) at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:107) at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:55) at com.norconex.collector.http.fetch.impl.GenericDocumentFetcher.fetchDocument(GenericDocumentFetcher.java:94) ... 11 more ´ Do you know where does it come from? (Apache connectiontimeouts? ) Best regards
One of the reason may be an HTTP Connection leak, but it's the first time its being reported. How many threads are you using? Do you mind sharing your config so we can try to reproduce?
Hey Pascal, we had 60 threads and 100 maxConnections. Our config look like this:
<?xml version="1.0" encoding="UTF-8"?>
<!--
Copyright 2010-2014 Norconex Inc.
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
<httpcollector id="Replication">
#set($workdir = "/opt/crawler/output")
<!-- Decide where to store generated files. -->
<progressDir>$workdir/progress</progressDir>
<logsDir>$workdir/logs</logsDir>
<crawlerDefaults>
<!-- Filter BEFORE download with RobotsTxt rules. Classes must
implement *.robot.IRobotsTxtProvider. Default implementation
is the following.
-->
<robotsTxt ignore="true" class="com.norconex.collector.http.robot.impl.StandardRobotsTxtProvider"/>
<!-- Establish whether to follow a page URLs or to index a given page
based on in-page meta tag robot information. Classes must implement
com.norconex.collector.http.robot.IRobotsMetaProvider.
Default implementation is the following.
-->
<robotsMeta ignore="true" class="com.norconex.collector.http.robot.impl.StandardRobotsMetaProvider" />
<userAgent>Crawler</userAgent>
<numThreads>60</numThreads>
<!-- Put a maximum depth to avoid infinite crawling (e.g. calendars). -->
<maxDepth>3</maxDepth>
<!-- Be as nice as you can to sites you crawl. -->
<delay default="10" />
<httpClientFactory class="com.norconex.collector.http.client.impl.GenericHttpClientFactory">
<maxConnections>100</maxConnections>
<connectionTimeout>300000</connectionTimeout>
</httpClientFactory>
</crawlerDefaults>
<crawlers>
<crawler id="Replication">
<!-- Where the crawler default directory to generate files is. -->
<workDir>$workdir</workDir>
<!-- Requires at least one start URL. -->
<startURLs>
#parse("/opt/crawler/conf.local/replicationgroups/Replication-starturls.xml")
</startURLs>
<!-- At a minimum make sure you stay on your domain. -->
<referenceFilters>
#parse("/opt/crawler/conf/shared-crawler-filters.xml")
</referenceFilters>
<linkExtractors>
<extractor class="com.norconex.collector.http.url.impl.HtmlLinkExtractor">
<!-- Which tags and attributes hold the URLs to extract -->
<tags>
<tag name="script" attribute="src" />
<tag name="a" attribute="href" />
<tag name="frame" attribute="src" />
<tag name="iframe" attribute="src" />
<tag name="img" attribute="src" />
<tag name="meta" attribute="http-equiv" />
<tag name="link" attribute="href" />
</tags>
</extractor>
</linkExtractors>
</crawler>
</crawlers>
</httpcollector>
Best Peer
That's a lot of threads! :-) I am curious to know what kind of machine you are using.
After some digging, it appears that sometimes the HTTP client is unable to detect that the server side closed the socket (usually unexpectedly). In which case it becomes stalled and remains in the pool without the socket being closed (and can't be used by another request).
I suspect this is what's happening to you here. And since you are performing an aggressive crawl, it is very possible it does not have time to realize the socket is dead on a connection before it could making it available to another request again... so it opens a new connection, and you eventually run out.
Starting with 4.4, Apache HttpClient added a feature so it can periodically validate if a connection is staled before leasing a new one after a certain period of inactivity. By default it does not perform that check. I will mark this ticket as a feature request to make this a configurable setting in the next release (with the addition of new features added in most recent version of HttpClient).
In the meantime, I recommend you try setting a <socketTimeout>
value so it will close open socket after too long no matter what.
All this being said, the above may not resolve your issue entirely because you may be doing too much in too little time. I mean, potentially up to 60 requests fired each 10 milliseconds means if it takes a few seconds before any socket connection is detected to be closed, you will soon have much more requests that you have free connections available (your existing timeouts being 30 seconds).
But at the same time, if you make your timeouts too low you can get several connection errors. So I would recommend making the max number of connections much higher if you want to keep the same crawl aggressiveness. Otherwise, or in addition, increase the delay a bit. See GenericDelayResolver.
I like that you are taking advantage of configuration fragments (#parse). I noticed you are also including a config fragment for your start URLs. You may be interested to know you can use a <urlsFile>
tag instead of including a config fragment (or in addition to). That tag allows you to specify a flat file, with 1 URL per line, and no need to XML-escape (i.e. a seed file). It is usually easier to maintain when you have a long list.
It is nice to see you push the limits of the HTTP Collector. Please keep us informed of your benchmarks.
Ok. It appears I'm running into this problem as well (java.net.SocketTimeoutException: Read timed out).(Timeout waiting for connection from pool). I am only crawling 4 threads. Using the code from 2 July.
`
<socketTimeout>60000</socketTimeout>
<connectionRequestTimeout>30000</connectionRequestTimeout>
`
Hey Pascal, we have almost the same number of exceptions with 150 and 850 MaxConnections. Do you have another idea?
Best
@schwipee you might wanna try and lower <connectionTimeout>300000</connectionTimeout>
to <connectionTimeout>30000</connectionTimeout>
5 minutes is a long time to wait for a connection
Also try and set <connectionRequestTimeout>0</connectionRequestTimeout>
So it will wait for a connection from manager indefinitely.
Tested it myself and there was some improvement, but not enough `
<connectionRequestTimeout>0</connectionRequestTimeout>``
30 seconds socketTimeout should be more then enough. But still getting read timeouts.
New test with all timeouts at 60000 ms results in much better results. Also not getting the connection pool exceptions anymore, just read timeouts.
Why the value has to be this large puzzles me, as website should respond much quicker.
The bottleneck is the number of connections available in your pool at any given time vs the number of threads asking for one. The connections do not become available/released fast enough when you request too many/too quickly. I think it is a math issue at this point. Even with the future enabling the new staled-check feature introduced in a recent version of HttpClient, it may not solve it entirely. The only way I can prevent these exception 100% is to have each thread wait until a new connection becomes available again as a default behavior. Not ideal as it may hide a problem with your config. Not sure what's best here.
Have you tried putting an insanely high number of for <maxConnections>
? If you put something like 10000000 (or whatever OS maximum), maybe it will never reach that maximum before they start expiring naturally after whatever timeout your provide. Something worth trying.
As for site response time, maybe the site does not respond as fast when you issue 60 requests at once, or maybe it is the local process that can't handle has much data simultaneously and causes the delays (CPU, bandwidth, etc).
I only run crawl with 4 threads, So 30 second timeouts should not happen. Let alone 60 seconds. But setting all timeouts to 60 seconds at least resolved my problem for most part.
Can one of you share the site(s) you are crawling and possibly the minimum config required to reproduce? I cannot reproduce with 10 threads and 10 milliseconds delay (using default values for various configurable timeouts).
You can try the latest snapshot. GenericHttpClientFactory now has these three new configuration options (from latest Apache HttpClient library):
<maxConnectionsPerRoute>...</maxConnectionsPerRoute>
<maxConnectionIdleTime>(milliseconds)</maxConnectionIdleTime>
<maxConnectionInactiveTime>(milliseconds)</maxConnectionInactiveTime>
maxConnectionsPerRoute
seems to indicate how many connections can be made on the same target. Default used to be 2. It is now 20.
maxConnectionIdleTime
will evict connections that have been idle in the pool for the time specified.
maxConnectionInactiveTime
tells Apache HttpClient to check if a connection is stalled after it has been inactive for the time you supply.
In addition, the default max connections was 20. It is now 200.
Hopefully these changes and additional options will help to get rid of your timeout issues. If they persist, let me know and we can create a new feature request to retry failed connections a configurable number of times.
Hey Pascal, thanks for pushing the new config options. After a few test, we have no timeouts anymore. In addition the new version/configs is much faster (2 minutes instead of 30-45 minutes). For me the ticket can be closed! Best regards
@schwipee can you share some information about the settings used?
@schwipee: Awesome! Thanks for trying and sharing the results with us. I am happy to close it.
Hey, i've an issue with Read time outs. For some channels it works perfectly but for others not. It only happen from time to time. I have no request in the log file for that time, so the crawler did not reach the url. I really don't know where the problem is.
XXXXReplication: 2015-06-09 01:10:46 ERROR - Cannot fetch document: XXXXXXXXX/?SearchTerm=* (Read timed out) XXXXReplication: 2015-06-09 01:10:46 ERROR - XXXXReplication: Could not process document: XXXXXXXXX/?SearchTerm=* (java.net.SocketTimeoutException: Read timed out) com.norconex.collector.core.CollectorException: java.net.SocketTimeoutException: Read timed out at com.norconex.collector.http.fetch.impl.GenericDocumentFetcher.fetchDocument(GenericDocumentFetcher.java:148) at com.norconex.collector.http.pipeline.importer.HttpImporterPipeline$DocumentFetcherStage.executeStage(HttpImporterPipeline.java:147) at com.norconex.collector.http.pipeline.importer.AbstractImporterStage.execute(AbstractImporterStage.java:31)
Best regards