buzzbangorg / bsbang-crawler-ng

Repository for the next generation Scrapy based Buzzbang Bioschemas crawler
Apache License 2.0
0 stars 0 forks source link

Keep track of links that are unvisited due to failed response #8

Open innovationchef opened 6 years ago

innovationchef commented 6 years ago

I am still not able to completely understand how the sitemap spider in working. The spider keeps crawling down the sitemap.xml until it receives a valid page response. In between the first request to final page - scrapy redirects from HTTP to HTTPS protocol for once in between, however, I am not able to figure out where it does so. Ideally, there should be a point where the response.status says 301 redirections, but the process_response in the middleware that I wrote skips (basically it is happening somewhere inside such that I can't log it from a middleware) this part in the middle and only outputs the final responses - 200. Thus, I am not able to log other 40x responses using the process_response() function. What if these responses are also being handled in the backend? (Which seems to be the only case) How to track these response statuses and log the URLs returning these responses? - There seems to be an answer, but I am not sure how to rigorously test it.

So, how do I test these? I mean, I cannot generate a 402 response on my own (Or maybe IDK how to do it) to test the custom response handlers for these responses.

justinccdev commented 6 years ago

I'm okay with simply dropping failed response and not revisiting. Perhaps if a whole website failed this would be an issue.