Open BrendanIAB opened 7 years ago
Per the spec, these changes make sense. In fact, based on my reading of the spec, only a single 301
, 302
, or 307
redirect should be allowed: a scheme change from http
to https
. Everything else must remain the same, otherwise you risk an invalid redirect. I would argue that allowing for redirecting domain.com
to www.domain.com
risks running afoul of section 5.5. If www.domain.com
is indeed the canonical domain for the site being crawled, then www.domain.com
will be the domain that is driving significant requests for advertising
, so it should be the domain that the crawler is set to perform its request to.
In my data set, we see that "www." is routinely stripped (for the domain presented in the domain field), so this change would, while ensuring greater literal spec compliance, result in a poorer practical outcome. Perhaps a quick revision to the spec to also permit "domain.com" -> "www.domain.com" redirects is the cleanest solution.
I suggest we only allow two styles of redirect (made this same comment in the slack channel):
The number of redirects allowed should probably be limited as well.
We implemented the redirect limitation using requests.session() and it worked great. Need to consider adding timeout value as well and validation of page type (i.e. HTML error page with response code 200) , I encountered several sites that return huge HTML files that cause an exception.
Test case results in 4-5 redirects due to paywall. This misconfiguration should be handle elegantly, with a clear max on redirects, rather than failing to parse paywall text.