ankurjain0985 / crawler4j

Automatically exported from code.google.com/p/crawler4j
0 stars 0 forks source link

Crawling over disallowed paths from robots.txt #334

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?
1. Only crawl pages with prefix http://fano.ics.uci.edu/
2. have robotstxtConfig enabled
3. crawl from seed http://fano.ics.uci.edu/

What is the expected output? What do you see instead?
# fano.ics.uci.edu

User-Agent: *
Disallow: /ca/rules/

Should not be crawling /ca/rules/

These are being crawled
URL: http://fano.ics.uci.edu/ca/rules/b3s23/g1.html
URL: http://fano.ics.uci.edu/ca/rules/b3s23/g2.html
URL: http://fano.ics.uci.edu/ca/rules/b3s23/g3.html
URL: http://fano.ics.uci.edu/ca/rules/b3s23/g4.html
URL: http://fano.ics.uci.edu/ca/rules/b3s23/g5.html
URL: http://fano.ics.uci.edu/ca/rules/b3s23/g6.html
URL: http://fano.ics.uci.edu/ca/rules/b3s23/g7.html
URL: http://fano.ics.uci.edu/ca/rules/b3s23/g8.html
URL: http://fano.ics.uci.edu/ca/rules/b3s23/g9.html

What version of the product are you using?

3.5

Please provide any additional information below.

Original issue reported on code.google.com by Dave.Hir...@gmail.com on 21 Jan 2015 at 9:06

GoogleCodeExporter commented 9 years ago
That was a bug - good catch!

Fixed in Revision: 4b25e33f2561

Original comment by avrah...@gmail.com on 22 Jan 2015 at 3:02