amir-jakoby / crawler-commons

Automatically exported from code.google.com/p/crawler-commons
0 stars 0 forks source link

Spaces in a comma separated list of names in a User-agent: line cause rules to be applicable to all agents #53

Closed GoogleCodeExporter closed 8 years ago

GoogleCodeExporter commented 8 years ago
To reproduce the problem create a robots.txt file that contains the following 
lines:

---
User-agent: One, Two, Three
Disallow: /

User-agent: *
Allow: /

---

If I understand things correctly, the above rules should forbid access only to 
agents 'One', 'Two' and 'Three'. The bug I'm referring to causes the disallow 
to apply to any agent, because of the way spaces are handled.

The problem is in:
Class: SimpleRobotRulesParser
Method: private boolean handleUserAgent(ParseState state, RobotToken token) 
Line: String[] agentNames = token.getData().split("[ \t,]");

The above line causes the string "One, Two, Three" to be split into the 
following tokens { "One", "", "Two", "", "Three" }

A few lines further down, the following condition is checked: 
if (targetName.startsWith(agentName)) { ... }

For the second token this becomes equivalent to:
if (targetName.startsWith("")) { ... }
The above always evaluates to true apparently.

Readily implementable solutions for fixing the problem:
- Change splitting to split by comma and trim whitespace characters from the 
resulting tokens
- Keep splitting as it is, but check for empty tokens

If assistance is needed I could provide a patch.

Best,

Alexandros

Original issue reported on code.google.com by paramyt...@gmail.com on 4 Oct 2014 at 2:17

GoogleCodeExporter commented 8 years ago
Thanks for the detailed report!

Fixed as of rev 138

Original comment by kkrugler...@transpac.com on 4 Oct 2014 at 4:32

GoogleCodeExporter commented 8 years ago
Thanks for taking care of this so fast. Any idea when the next release would be 
due approximately? You may also want to give nutch devs a heads up, they would 
need to update to that release once available (I worked myself back to 
crawler-commons after noticing the problem during a nutch crawl).

Original comment by paramyt...@gmail.com on 4 Oct 2014 at 5:42

GoogleCodeExporter commented 8 years ago
Lewis had threatened to push out the 0.5 release recently, but I'm guessing 
that didn't happen.

He's a committer on Nutch as well, so that should be a sufficient heads-up :)

-- Ken

Original comment by kkrugler...@transpac.com on 4 Oct 2014 at 8:51

GoogleCodeExporter commented 8 years ago
Perfect. Thanks again!

Original comment by paramyt...@gmail.com on 5 Oct 2014 at 7:40

GoogleCodeExporter commented 8 years ago
Just FYI, this fix went out with the 0.5 release of crawler-commons, even 
though that wasn't called out explicitly in the CHANGES.txt file.

Original comment by jennakru...@gmail.com on 16 Oct 2014 at 1:41