Closed wowasa closed 2 years ago
Thanks. Good point! Do you want to submit a PR for it?
On Wed, 18 May 2022, 17:33 Wolfgang Walter SAUER, @.***> wrote:
In line 327 of the class is a test »entity.getContentLength() <= Constants.MAX_ARRAY_SIZE« which should, in my view, consider the configured http.content.limit. We're getting an exception for very large files, although we have http.content.limit=0, since we just need some information from the http header.
— Reply to this email directly, view it on GitHub https://github.com/DigitalPebble/storm-crawler/issues/970, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABVJT7QTJEQ7NLIJP3ZJYLVKULW7ANCNFSM5WJB67VQ . You are receiving this because you are subscribed to this thread.Message ID: @.***>
@wowasa do you know you can specify http.method.head=true in the config? This will result in a HEAD request and you will get only the HTTP headers you are after?
@jnioche : thank you - this is what we're doing. We're permanently checking URLs for availability in our project with a HEAD request first and a GET request next only in case the HEAD request doesn't return a status 200 or 304, with little modifications of the FetcherBolt class, since we need to know the number of redirects. Unfortunately we have some URLs with a timeout in HEAD request, which is strange, and a content-length on GET request, which exceeds the limit. Question is now if the behavior of HttpProtocol class is qualified as a bug of storm crawler - then I'm waiting for a fix - or not. In this case I'm going to fix it for our project.
p.s. - not to be misunderstood: I could also fork it and make a suggestion for a fix then, if it is qualified as a bug
thanks @wowasa, yes it is a bug. There is no point in worrying about something being too big when we know we are going to trim it anyway. A PR would be much appreciated.
fine. Will do so next week
@wowasa I see that you have created a branch on your fork. Could you turn that into a Pull Request so that I can merge it into our code? Thanks!
sorry - I had just forgotten it
merged, thanks a lot! While we are here, I gather your work is related to Clarin ERIC. Would you mind sharing how StormCrawler is used there? Could we mention Clarin on the Powered-By page?
For the mention I have to ask my colleague first - I'll give you a feedback asp
We're simply using the StormCrawler for link checking in our project. Means we're collecting XML files with metada on linguistic resources and we're checking permanently response-status, -time and the number of redirects of the included links. This is our link ckecking project https://github.com/clarin-eric/linkchecker where I only had to change the MetricsFetcherBolt of StormCrawler slightly to send a HEAD request first, a GET request next in case of failure and to count the number of redirects.
got green light. You can mention it on your page and we're grateful for your work invested in Strom Crawler anyway
thanks and thanks @wowasa
In line 327 of the class is a test »entity.getContentLength() <= Constants.MAX_ARRAY_SIZE« which should, in my view, consider the configured http.content.limit. We're getting an exception for very large files, although we have http.content.limit=0, since we just need some information from the http header.