apache / incubator-stormcrawler

A scalable, mature and versatile web crawler based on Apache Storm
https://stormcrawler.apache.org/
Apache License 2.0
887 stars 262 forks source link

org.apache.http.NoHttpResponseException: The target server failed to respond #405

Closed Laurent-Hervaud closed 4 years ago

Laurent-Hervaud commented 7 years ago

I have a lot of FETCH_ERROR (about ten percent on one million french url). On debug i can see this error : org.apache.http.NoHttpResponseException: The target server failed to respond Sometimes, it's working after many retries ? Here some of the url : http://www.serigraph-herault.fr/ http://www.courtage-mayenne.fr/ http://www.alur-diagnostics-sete.fr/ Is there something wrong with HttpProtocol.java and org.apache.http.impl.client.HttpClients ? I purchase my investigations

jnioche commented 7 years ago

Could it be that you are at the limits of your bandwidth? How many fetch threads are you using?

Laurent-Hervaud commented 7 years ago

I was thinking that first, but i have the same error with just 1 url in the seed list

jnioche commented 7 years ago

The 3 sites you mentioned earlier all point to the same server (193.252.138.58). Could it be that you got blacklisted by them?

Laurent-Hervaud commented 7 years ago

I know for the same server. It's the first web hosting company in France for professionnal. I try multiple test in local and on aws for blacklist. All is working with a simple curl. I also try with Nutch and it's working. I also try multiple user agent.

jnioche commented 7 years ago

I can't reproduce the issue.

6493 [FetcherThread] INFO  c.d.s.b.FetcherBolt - [Fetcher #3] Fetched http://www.serigraph-herault.fr/ with status 200 in msec 363
6521 [FetcherThread] INFO  c.d.s.b.FetcherBolt - [Fetcher #3] Fetched http://www.alur-diagnostics-sete.fr/ with status 200 in msec 393
6530 [FetcherThread] INFO  c.d.s.b.FetcherBolt - [Fetcher #3] Fetched http://www.courtage-mayenne.fr/ with status 200 in msec 402
Laurent-Hervaud commented 7 years ago

I found the mistake in HttpProtocol.java by enabling AutomaticRetries : builder = HttpClients.custom().setUserAgent(userAgent) .setConnectionManager(CONNECTION_MANAGER) .setConnectionManagerShared(true).disableRedirectHandling(); //.disableAutomaticRetries();

Here the result log : 12932 [Thread-48-fetch-executor[14 14]] INFO o.a.h.i.e.RetryExec - I/O exception (org.apache.http.NoHttpResponseException) caught when processing request to {}->http://www.serigraph-herault.fr:80: The target server failed to respond 12933 [Thread-48-fetch-executor[14 14]] INFO o.a.h.i.e.RetryExec - Retrying request to {}->http://www.serigraph-herault.fr:80 13300 [Thread-48-fetch-executor[14 14]] INFO c.d.s.b.SimpleFetcherBolt - [Fetcher #14] Fetched http://www.serigraph-herault.fr with status 200 in 371 after waiting 0

Why disabling AutomaticRetries and RedirectHandling ?

jnioche commented 7 years ago

I would not call that a mistake. Retrying the URL does not explain why it failed in the first place, as you pointed out initially, it worked after retrying

Why disabling AutomaticRetries and RedirectHandling ?

  1. retries -> because we want to control politeness and to be efficient, there is no point trying again right away when it is likely that it will fail, when we could be fetching from a different server
  2. redirect -> politeness again and also the target URL could already be known and perhaps even fetched

you can set the schedule for fetch_errors to a low value so that the URL gets eligible for re-fetching soon.

It would be interesting to know why it fails on the first attempt.

jnioche commented 7 years ago

Closing for now. Please reopen if necessary.

jnioche commented 7 years ago

Note for self\: seeing the same problem with

The explanation can be found here http://stackoverflow.com/questions/10558791/apache-httpclient-interim-error-nohttpresponseexception

The issue does not happen when specifying http.skip.robots=true, my interpretation of this is that the server closes the connection prematurely when we try to get the robots and when we query for the main URL straight away, we get this issue.

Setting a retry value at the protocol level is one possible solution but as pointed out earlier such URLs get retried by stormcrawler later on anyway - with the politeness.

A better approach is suggested in http://stackoverflow.com/questions/10570672/get-nohttpresponseexception-for-load-testing/10680629 i.e. set a lower validate-before-reuse time

connectionManager = new PoolingHttpClientConnectionManager();
connectionManager.setValidateAfterInactivity(connectionValidateLimit);

but even if we set a low value for setValidateAfterInactivity(), this would not get applied unless we applied the politeness setting between the call to robots.txt and the fetching of the URL, for which I had opened (and closed) #343.

As a quick test, I added a call to Thread.sleep call to the HttpRobotRulesParser for a few seconds and the fetches were successful after that! I will reopen #343 but make the behavior configurable. Ideally, the FetcherBolt could - if configured to be polite after querying robots - pop the URL back into the queue and deal with another queue until the politeness delay has passed.

abhishekransingh commented 6 years ago

Another way to retry is if you're using Spring, then you can use @Retryable. Here is the code snippet:

@Retryable(maxAttemptsExpression = "#{${startup.job.max.try}}", value = {NoHttpResponseException.class}, backoff = @Backoff(delayExpression = "#{${startup.job.delay}}", multiplierExpression = "#{${startup.job.multiplier}}")) public void callHttpEndpint() throws IOException { //Your code to call HTTP REST Endpoint here }

ade90036 commented 5 years ago

@jnioche why robot.txt has anything to do with this issue? Is the underline httpclient trying too had to be smart?

jnioche commented 4 years ago

Can't reproduce the problem, probably fixed itself by upgrading the version of httpclient