apache / incubator-stormcrawler

A scalable, mature and versatile web crawler based on Apache Storm
https://stormcrawler.apache.org/
Apache License 2.0
887 stars 262 forks source link

HttpProtocol doesn't consider http.content.limit in test for filesize #970

Closed wowasa closed 2 years ago

wowasa commented 2 years ago

In line 327 of the class is a test »entity.getContentLength() <= Constants.MAX_ARRAY_SIZE« which should, in my view, consider the configured http.content.limit. We're getting an exception for very large files, although we have http.content.limit=0, since we just need some information from the http header.

jnioche commented 2 years ago

Thanks. Good point! Do you want to submit a PR for it?

On Wed, 18 May 2022, 17:33 Wolfgang Walter SAUER, @.***> wrote:

In line 327 of the class is a test »entity.getContentLength() <= Constants.MAX_ARRAY_SIZE« which should, in my view, consider the configured http.content.limit. We're getting an exception for very large files, although we have http.content.limit=0, since we just need some information from the http header.

— Reply to this email directly, view it on GitHub https://github.com/DigitalPebble/storm-crawler/issues/970, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABVJT7QTJEQ7NLIJP3ZJYLVKULW7ANCNFSM5WJB67VQ . You are receiving this because you are subscribed to this thread.Message ID: @.***>

jnioche commented 2 years ago

@wowasa do you know you can specify http.method.head=true in the config? This will result in a HEAD request and you will get only the HTTP headers you are after?

wowasa commented 2 years ago

@jnioche : thank you - this is what we're doing. We're permanently checking URLs for availability in our project with a HEAD request first and a GET request next only in case the HEAD request doesn't return a status 200 or 304, with little modifications of the FetcherBolt class, since we need to know the number of redirects. Unfortunately we have some URLs with a timeout in HEAD request, which is strange, and a content-length on GET request, which exceeds the limit. Question is now if the behavior of HttpProtocol class is qualified as a bug of storm crawler - then I'm waiting for a fix - or not. In this case I'm going to fix it for our project.

p.s. - not to be misunderstood: I could also fork it and make a suggestion for a fix then, if it is qualified as a bug

jnioche commented 2 years ago

thanks @wowasa, yes it is a bug. There is no point in worrying about something being too big when we know we are going to trim it anyway. A PR would be much appreciated.

wowasa commented 2 years ago

fine. Will do so next week

jnioche commented 2 years ago

@wowasa I see that you have created a branch on your fork. Could you turn that into a Pull Request so that I can merge it into our code? Thanks!

wowasa commented 2 years ago

sorry - I had just forgotten it

jnioche commented 2 years ago

merged, thanks a lot! While we are here, I gather your work is related to Clarin ERIC. Would you mind sharing how StormCrawler is used there? Could we mention Clarin on the Powered-By page?

wowasa commented 2 years ago

For the mention I have to ask my colleague first - I'll give you a feedback asp

We're simply using the StormCrawler for link checking in our project. Means we're collecting XML files with metada on linguistic resources and we're checking permanently response-status, -time and the number of redirects of the included links. This is our link ckecking project https://github.com/clarin-eric/linkchecker where I only had to change the MetricsFetcherBolt of StormCrawler slightly to send a HEAD request first, a GET request next in case of failure and to count the number of redirects.

wowasa commented 2 years ago

got green light. You can mention it on your page and we're grateful for your work invested in Strom Crawler anyway

jnioche commented 2 years ago

thanks and thanks @wowasa