Folyd / robotstxt

A native Rust port of Google's robots.txt parser and matcher C++ library.
https://crates.io/crates/robotstxt
Apache License 2.0
88 stars 13 forks source link

crashes for deviantart robots.txt #1

Closed iyzana closed 3 years ago

iyzana commented 3 years ago

When parsing https://www.deviantart.com/robots.txt

User-agent: *
Disallow: /*q=
Disallow: /users/*?
Disallow: /join/*?
Disallow: /morelikethis/
Disallow: /download/
Disallow: /checkout/
Disallow: /global/
Disallow: /api/
Disallow: /critiques/

Sitemap: http://sitemaps.deviantart.net/sitemap-index.xml.gz

the parser fails with

thread 'main' panicked at 'assertion failed: !val.is_empty()', /home/me/.local/share/cargo/registry/src/github.com-1ecc6299db9ec823/robotstxt-0.2.0/src/parser.rs:207:17

Reproduction:

use robotstxt::DefaultMatcher;

fn main() {
    let robots_content = r#"User-agent: *
Disallow: /*q=
Disallow: /users/*?
Disallow: /join/*?
Disallow: /morelikethis/
Disallow: /download/
Disallow: /checkout/
Disallow: /global/
Disallow: /api/
Disallow: /critiques/

Sitemap: http://sitemaps.deviantart.net/sitemap-index.xml.gz"#;
    let mut matcher = DefaultMatcher::default();
    matcher.one_agent_allowed_by_robots(&robots_content, "oldnews", "https://www.deviantart.com/");
}

I'm assuming it is because of the line between the Disallows and the Sitemap, which only contains a single space.

iyzana commented 3 years ago

It seems someone else patched this bug on their fork: https://github.com/scascketta/robotstxt/commit/ffe972d507a0a30a21b0b164329f9a1fff73ce85

Folyd commented 3 years ago

Hi @iyzana. Hugely thanks for your feedback. 👍 This has been fixed via 67475a1.