apache / incubator-stormcrawler

A scalable, mature and versatile web crawler based on Apache Storm
https://stormcrawler.apache.org/
Apache License 2.0
887 stars 262 forks source link

okhttp protocol: trimmed content because of content limit not reliably marked #756

Closed sebastian-nagel closed 5 years ago

sebastian-nagel commented 5 years ago

(see NUTCH-2729 and commoncrawl/nutch#10 for the same issue in Nutch)

The marking of trimmed content (by content limit) is not reliable and reproducibly fails for compressed or chunked content, or when there is no Content-Length header: in this case marking as http.trimmed: true in metadata checks whether the internal buffer of okhttp holds more data than requested. For a reliable detection we need to request one byte more than the configured http.content.limit. Esp. for compressed content the internal buffer of okhttp tends to hold exactly the number of requested bytes.

One example, fetching a 9 MB sitemap with http.content.limit: 1048576 and http.store.headers: true:

> java -cp ... com.digitalpebble.stormcrawler.protocol.okhttp.HttpProtocol ... http://localhost/sitemap.xml
http://localhost/sitemap.xml
date: Thu, 26 Sep 2019 14:03:25 GMT
server: Apache/2.4.29 (Ubuntu)
transfer-encoding: chunked
vary: Accept-Encoding
last-modified: Mon, 19 Mar 2018 07:05:39 GMT
keep-alive: timeout=5, max=100
_request.headers_: GET /sitemap.xml 
...
Accept-Encoding: gzip

...
_response.ip_: 127.0.0.1
_response.headers_: HTTP/1.1 200 OK
...
Vary: Accept-Encoding
Content-Encoding: gzip
Keep-Alive: timeout=5, max=100
Connection: Keep-Alive
Transfer-Encoding: chunked
Content-Type: application/xml

status code: 200
content length: 1048576
fetched in : 157 msec

The content has exactly the size of the limit but not trimming/truncation marked.

jnioche commented 5 years ago

thanks @sebastian-nagel