Closed guillaumepitel closed 6 years ago
I've received a complaint by the site admin of ywam.org that my crawler does not follow its robots.txt which excludes all crawling.
https://update.ywam.org/robots.txt
It contains this :
<U+FEFF>User-agent: * Disallow: /
The first character is actually a three-bytes sequence representing the dreading UTF-8 BOM https://en.wikipedia.org/wiki/Byte_order_mark
Since the robots.txt parser uses ISO-8859-1 encoding, it breaks and does not recognise the format.
This test can be added to check this behaviour :
@Test public void testDisallowEverytingWithUTFBOM() throws Exception { proxy = new SimpleFixedHttpProxy(); URI robotsURL = URI.create("http://foo.bar/robots.txt"); proxy.add200(robotsURL, "", "\ufeffUser-agent: *\n" + "Disallow: /\n" ); final URI disallowedUri1 = URI.create("http://foo.bar/goo/zoo.html"); // Disallowed final URI disallowedUri2 = URI.create("http://foo.bar/gaa.html"); // Disallowed final URI disallowedUri3 = URI.create("http://foo.bar/"); // Disallowed proxy.start(); HttpClient httpClient = FetchDataTest.getHttpClient(new HttpHost("localhost", proxy.port()), false); FetchData fetchData = new FetchData(Helpers.getTestConfiguration(this)); fetchData.fetch(robotsURL, httpClient, null, null, true); char[][] filter = URLRespectsRobots.parseRobotsResponse(fetchData, "any"); assertFalse(URLRespectsRobots.apply(filter, disallowedUri1)); assertFalse(URLRespectsRobots.apply(filter, disallowedUri2)); assertFalse(URLRespectsRobots.apply(filter, disallowedUri3)); }
Google says robots.txt must be in UTF-8 and that they ignore BOMs https://developers.google.com/search/reference/robots_txt
Fixing this may not be as easy as changing the reader's encoding, the tokenizer must be modified too.
Thanks. It is fixed here https://github.com/LAW-Unimi/BUbiNG/commit/84fed26099dd487dccd7b0017da1df9495111eb7 We now read (but ignore) BOM, if present, and parse robots.txt as UTF-8, as we should.
I've received a complaint by the site admin of ywam.org that my crawler does not follow its robots.txt which excludes all crawling.
https://update.ywam.org/robots.txt
It contains this :
<U+FEFF>User-agent: * Disallow: /
The first character is actually a three-bytes sequence representing the dreading UTF-8 BOM https://en.wikipedia.org/wiki/Byte_order_mark
Since the robots.txt parser uses ISO-8859-1 encoding, it breaks and does not recognise the format.
This test can be added to check this behaviour :
Google says robots.txt must be in UTF-8 and that they ignore BOMs https://developers.google.com/search/reference/robots_txt
Fixing this may not be as easy as changing the reader's encoding, the tokenizer must be modified too.