LAW-Unimi / BUbiNG

The LAW next generation crawler.
http://law.di.unimi.it/software.php#bubing
Apache License 2.0
85 stars 24 forks source link

robots.txt parsed as ISO-8859-1 - break when there's a UTF-8 BOM #17

Closed guillaumepitel closed 6 years ago

guillaumepitel commented 6 years ago

I've received a complaint by the site admin of ywam.org that my crawler does not follow its robots.txt which excludes all crawling.

https://update.ywam.org/robots.txt

It contains this :

<U+FEFF>User-agent: * Disallow: /

The first character is actually a three-bytes sequence representing the dreading UTF-8 BOM https://en.wikipedia.org/wiki/Byte_order_mark

Since the robots.txt parser uses ISO-8859-1 encoding, it breaks and does not recognise the format.

This test can be added to check this behaviour :

@Test
public void testDisallowEverytingWithUTFBOM() throws Exception {
        proxy = new SimpleFixedHttpProxy();
        URI robotsURL = URI.create("http://foo.bar/robots.txt");
        proxy.add200(robotsURL, "",
                "\ufeffUser-agent: *\n" +
                "Disallow: /\n"
        );
        final URI disallowedUri1 = URI.create("http://foo.bar/goo/zoo.html"); // Disallowed
        final URI disallowedUri2 = URI.create("http://foo.bar/gaa.html"); // Disallowed
        final URI disallowedUri3 = URI.create("http://foo.bar/"); // Disallowed
        proxy.start();

    HttpClient httpClient = FetchDataTest.getHttpClient(new HttpHost("localhost", proxy.port()), false);

    FetchData fetchData = new FetchData(Helpers.getTestConfiguration(this));
    fetchData.fetch(robotsURL, httpClient, null, null, true);
    char[][] filter = URLRespectsRobots.parseRobotsResponse(fetchData, "any");
    assertFalse(URLRespectsRobots.apply(filter, disallowedUri1));
    assertFalse(URLRespectsRobots.apply(filter, disallowedUri2));
    assertFalse(URLRespectsRobots.apply(filter, disallowedUri3));
}

Google says robots.txt must be in UTF-8 and that they ignore BOMs https://developers.google.com/search/reference/robots_txt

Fixing this may not be as easy as changing the reader's encoding, the tokenizer must be modified too.

boldip commented 6 years ago

Thanks. It is fixed here https://github.com/LAW-Unimi/BUbiNG/commit/84fed26099dd487dccd7b0017da1df9495111eb7 We now read (but ignore) BOM, if present, and parse robots.txt as UTF-8, as we should.