Closed niels closed 8 years ago
That's the most concise use case I have seen so far. :-)
It seems to appear when fetching robot.txt only. Will investigate.
After some research, I found the problem is with the server not properly encoding redirect URLs. The best explanation summary I found is here: http://stackoverflow.com/a/7654605/3974380
RFC 2616 specifies that the Location header should contain a URI as defined by RFC 1630, which requires a URI be 7-bit clean ASCII with any special characters URL encoded.
In other words, the server is delivering the URI incorrectly and should be escaping it.
After analyzing at the "Location:" in HTTP headers that come back, I can confirm the redirect URL is not encoded properly. You should contact the site owner about this.
I am not sure how a workaround could be implemented other than forcing to read the HTTP "Location" header using a specific charset, or trying to auto-detect it. It could be a risky proposition given most sites probably respect the standard. In this specific case, I can read the URL properly if I force it to use ISO-8829_1 (UTF-8 does not work).
Thanks for the investigation; very interesting.
I agree that non-ASCII characters shouldn't be present in HTTP headers (unless they are properly encoded / escaped). In practice however, the standard unfortunately seems to be violated quite frequently – as is so often the case on the web. The site referenced here is obviously a major case in point, but this Google search suggests that globally this isn't as rare a problem as one might hope. I also unearthed many bug reports for both server- and client-side software components that lamented seeming mis-handling of non-ASCII redirects further confirming that the problem is somewhat frequently encountered.
For compatibility reasons, browsers seem to be more relaxed than the RFCs would demand. At least Firefox and Chrome seem to follow the redirect "correctly" (meaning: as the site author intended). E.g. if I go to http://www.mascus.com/agriculture/used-other-tractor-accessories/other/5pen7jcp.html in either browser, I get redirected to http://www.mascus.com/agriculture/used-other-tractor-accessories/other-гідравліка-спецтехніка/5pen7jcp.html even though the Location
header is not properly encoded.
For Firefox, https://bugzilla.mozilla.org/show_bug.cgi?id=1142083 details the fix while https://bugzilla.mozilla.org/show_bug.cgi?id=439616 gives some more information about the use-case.
Generally speaking, I would prefer for a crawler to behave as similarly to real-world browsers as possible. This is because site authors generally target the latter and not the former. If I can access a site with my web browser, I would expect the crawler to be able to access that same page (and parse it in the same manner).
At the same time however, development resources here are of course much more limited than for the major browsers. Thus we can not come up with an implementation that will work as "expected" in all cases. From a philosophical standpoint as well, I would normally be opposed to programming special / edge cases into general-purpose software such as this crawler.
Nevertheless, choking on – what appears to be – a somewhat common encoding of redirects seems to be a not insignificant flaw. Thus, I would like to propose the following implementation which I think strikes a good balance between compatibility and complexity:
encoding
in the Content-Type
header?
An interesting alternative to (1.ii.b) would be to fall back to a per-crawler default (if configured) instead. This is a feature that you have suggested in https://github.com/Norconex/collector-http/issues/194#issuecomment-162599045 and which I would find very useful.
This logic could be applied to all HTTP headers, not just Location
.
While the logic sounds simple, I can't estimate the implementation effort as I am not yet sufficiently familiar with the codebase. Please feel free to close as WONTFIX if it would be a major hassle.
Thanks for your research and suggestions! I am in agreement standards are often not respected. What is important is we cover the standards first, but let's not limit ourselves to that and let's try to support what's in the real world. What you are proposing makes lots of sense and I now plan to implement that (or very similar).
I have added a new configuration option in the latest snapshot. There is now a new redirectURLProvider
tag which allow custom implementations. The default implementation is GenericRedirectURLProvider and applies the logic you proposed, slightly modified. Please try the following, which should solve your case:
<redirectURLProvider
class="com.norconex.collector.http.redirect.impl.GenericRedirectURLProvider"
fallbackCharset="ISO-8859-1" />
This is perfect! The latest snapshot follows all redirects "correctly" when an appropriate fallbackCharset
has been set.
Thanks a lot for your diligence on this.
Given a redirect from http://www.mascus.com/agriculture/used-other-tractor-accessories/%D0%B3%D1%96%D0%B4%D1%80%D0%B0%D0%B2%D0%BB%D1%96%D0%BA%D0%B0-%D1%81%D0%BF%D0%B5%D1%86%D1%82%D0%B5%D1%85%D0%BD%D1%96%D0%BA%D0%B0-%D0%B3%D1%96%D0%B4%D1%80%D0%B0%D0%B2%D0%BB%D1%96%D0%BA%D0%B0-%D1%81%D0%BF%D0%B5%D1%86%D1%82%D0%B5%D1%85%D0%BD%D1%96%D0%BA%D0%B0/5pen7jcp.html to http://www.mascus.com/agriculture/used-other-tractor-accessories/other-гідравліка-спецтехніка/5pen7jcp.html, the collector somehow chokes on the Cyrillic characters in the (new) target URL:
Redirect:
Test-Case Config
Result
Note that the crawler detects the redirect as http://www.mascus.com/agriculture/used-other-tractor-accessories/other-гÑдÑавлÑка-ÑпеÑÑеÑнÑка/5pen7jcp.html when it should be http://www.mascus.com/agriculture/used-other-tractor-accessories/other-гідравліка-спецтехніка/5pen7jcp.html and then further tries to access the robots.txt at http://www.mascus.comнÑка/5pen7jcp.html/robots.txt which is an invalid hostname resulting in an exception.