amir-jakoby / crawler-commons

Automatically exported from code.google.com/p/crawler-commons
0 stars 0 forks source link

Support Googlebot-compatible regular expressions in URL specifications #17

Closed GoogleCodeExporter closed 8 years ago

GoogleCodeExporter commented 8 years ago
As an example, see http://www.scottish.parliament.uk/robots.txt

User-agent: *
Disallow: /*.htm$

See http://support.google.com/webmasters/bin/answer.py?hl=en&answer=156449 for 
details on how Google handles wildcards. It's unclear to me whether anything 
other than '*' and '$' are treated specially.

I'll file a separate issue re matching against query parameters, which might 
not be supported currently.

Original issue reported on code.google.com by kkrugler...@transpac.com on 17 Mar 2013 at 6:20

GoogleCodeExporter commented 8 years ago
See also 
https://developers.google.com/webmasters/control-crawl-index/docs/robots_txt

Original comment by kkrugler...@transpac.com on 17 Mar 2013 at 6:42

GoogleCodeExporter commented 8 years ago
Hi Ken,
Is there anything specifically we would like to know about the use of regex URL 
specifications in robots.txt from a web master POV?
I am in touch with the IT team @Scottish Parliament and it would be as good an 
opportunity as any to get more info from them should we need it.

Original comment by lewis.mc...@gmail.com on 18 Mar 2013 at 9:43

GoogleCodeExporter commented 8 years ago
Hi Lewis - nothing comes to mind directly, though it might be interesting to 
know why they want to disallow all *.htm pages.

Normally that's what you'd want to crawl, and you'd use a regex to exclude 
other file types.

Original comment by kkrugler...@transpac.com on 18 Mar 2013 at 10:50

GoogleCodeExporter commented 8 years ago
Rolled in patch from alparslanavci (r113 and r114)

Original comment by kkrugler...@transpac.com on 13 Mar 2014 at 11:52