bopoda / robots-txt-parser

PHP class for parse all directives from robots.txt files according to specifications
http://robots.jeka.by
MIT License
44 stars 17 forks source link

Remove asterisk (*) at the end of path. #15

Open LeMoussel opened 7 years ago

LeMoussel commented 7 years ago

Search engine allow an asterisk (*) to match any sequence of characters, and a dollar sign ($) to match the end of the URL. So, to block spiders from downloading any JPEG image files, one might use:

User-agent: Disallow: /.jpg$

Indeed, blocking spidering of certain file types is the most popular use for wildcards. Most people who are using wildcards for anything else are doing so entirely unnecessarily. For example, a lot of sites have the following rule:

Disallow: /Example/*

The use of the non-standard wildcard above is useless, as this rule is equivalent to:

Disallow: /Example/

This is because rules are by default partial paths, and will match any path beginning with that string.

The feature would be to remove asterisk (*) at the end of path and log message to indicate the error.

bopoda commented 7 years ago

At the moment

Disallow: /Example/*

and

Disallow: /Example/

Will be transform into regexp

/^\/Example\/.*/

and

/^\/Example\//

which will equally work eventually (the end of a string).

So:

  1. remove asterisk (*) at the end of path - possible, but no affect on RobotsTxtValidator.
  2. we can log message about redundant asterisk at the end of path (for user information only).
LeMoussel commented 7 years ago

Disallow: /Example/* & Disallow: /Example/ are similar. Eg Disallow: /Example/* == Disallow: /Example/ I am not a specialist in regexp but is the behavior of /^/Example/.*/ is different from /^/Example// ?

My point of view for a better understanding, I think that it would be preferable to remove asterisk (*) at the end of path => Get same regexp.

It' a good idea to Log message about redundant asterisk at the end of path (for user information only.

bopoda commented 7 years ago

^/Example ^/Example.*

These both regexps will work identical. We can do it, but it will not have effect. We can log message for user information. Also here https://github.com/bopoda/robots-txt-parser/issues/9.