apache / incubator-stormcrawler

A scalable, mature and versatile web crawler based on Apache Storm
https://stormcrawler.apache.org/
Apache License 2.0
883 stars 261 forks source link

Replace HTTP protocol implementation #23

Closed jakekdodd closed 9 years ago

jakekdodd commented 9 years ago

Before I did anything with this, wanted to check and see if there was a reason why HTTP 1.0 was hardcoded. Right now, line 167 of HttpResponse sets the protocol version to 1.0, regardless of whether http.useHttp11: true is set in the configuration:

StringBuffer reqStr = new StringBuffer("GET ");
            if (http.useProxy()) {
                reqStr.append(url.getProtocol() + "://" + host + portString
                        + path);
            } else {
                reqStr.append(path);
            }

            reqStr.append(" HTTP/1.0\r\n");

Is this intentional?

jnioche commented 9 years ago

Not really. As you probably noticed, the code for the protocol is a port from the Nutch equivalent. It's just that I forgot to do that. Not sure that the protocol implementation does anything about the features of 1.1 [http://www8.org/w8-papers/5c-protocols/key/key.html] anyway.

Ideally we should replace this low level stuff by an external library like [http://hc.apache.org/] if possible.

jakekdodd commented 9 years ago

Looks like HttpResponse has a mechanism for handling chunked Transfer-Encoding, which is in HTTP/1.1. I didn't read through the whole class, so it's possible there are a few other things from HTTP/1.1 in there.

I'll try this out with 1.1 enabled, and see if everything works as expected. The fact that this is still hardcoded in Nutch's protocol-http plugin, and only protocol-httpclient works with 1.1, makes me wonder if something breaks with 1.1 enabled.

And +1 for using an external library. Nutch's protocol-httpclient looks like a good starting point

kkrugler commented 9 years ago

There was a not-complete stab at creating an HTTP fetcher in crawler-commons that I started while ago. And I think that Oleg did some work in the Droids project to create a few custom HttpClient classes that would make it more resilient to broken response headers and such, IIRC.

jnioche commented 9 years ago

@jakekdodd Nutch's protocol-httpclient is not very reliable and uses a deprecated version of the API. Better to start from a clean slate I think.

jakekdodd commented 9 years ago

@jnioche huh, good to know. The google-http-java-client is another one we might want to examine. I haven't used it personally, but it utilizes Apache HttpClient, and Google's stuff tends to work pretty well.

jnioche commented 9 years ago

I've made good progress on a HTTP implementation based on [http://hc.apache.org/]. Will commit post 0.5 release