Closed jakekdodd closed 9 years ago
Not really. As you probably noticed, the code for the protocol is a port from the Nutch equivalent. It's just that I forgot to do that. Not sure that the protocol implementation does anything about the features of 1.1 [http://www8.org/w8-papers/5c-protocols/key/key.html] anyway.
Ideally we should replace this low level stuff by an external library like [http://hc.apache.org/] if possible.
Looks like HttpResponse has a mechanism for handling chunked Transfer-Encoding, which is in HTTP/1.1. I didn't read through the whole class, so it's possible there are a few other things from HTTP/1.1 in there.
I'll try this out with 1.1 enabled, and see if everything works as expected. The fact that this is still hardcoded in Nutch's protocol-http plugin, and only protocol-httpclient works with 1.1, makes me wonder if something breaks with 1.1 enabled.
And +1 for using an external library. Nutch's protocol-httpclient looks like a good starting point
There was a not-complete stab at creating an HTTP fetcher in crawler-commons that I started while ago. And I think that Oleg did some work in the Droids project to create a few custom HttpClient classes that would make it more resilient to broken response headers and such, IIRC.
@jakekdodd Nutch's protocol-httpclient is not very reliable and uses a deprecated version of the API. Better to start from a clean slate I think.
@jnioche huh, good to know. The google-http-java-client is another one we might want to examine. I haven't used it personally, but it utilizes Apache HttpClient, and Google's stuff tends to work pretty well.
I've made good progress on a HTTP implementation based on [http://hc.apache.org/]. Will commit post 0.5 release
Before I did anything with this, wanted to check and see if there was a reason why HTTP 1.0 was hardcoded. Right now, line 167 of HttpResponse sets the protocol version to 1.0, regardless of whether
http.useHttp11: true
is set in the configuration:Is this intentional?