Open amgciadev opened 1 year ago
@amgciadev : If you've found that Disallow
is working to remove those pages from Google indexing, then I agree it might be useful to add to the default robots.txt.
Feel free to send us a PR and we can get it reviewed. Thanks!
wildcards are not supported by the robots.txt specification. Ending wildcards are usually honored: xxx/ , but middle ones are not: xxx//yyy Adding this to the fact that many collectors do not respect robots, I would apply these rules (or similar ones) to Apache.
vi /etc/apache2/sites-enabled/default-ssl.conf
RewriteCond %{HTTP_USER_AGENT} (bot|crawl|robot) [NC]
RewriteCond %{REQUEST_URI} !bitstream
RewriteRule ^(.*)(search-filter|browse|discover|statistics|recent-submissions|request-a-copy)(.*) - [F,L]
It can be done in tomcat using rewriteFilter.jar (http://www.tuckey.org/urlrewrite/)
Add to pom.xml:
<dependency>
<groupId>org.tuckey</groupId>
<artifactId>urlrewritefilter</artifactId>
<version>4.0.3</version>
</dependency>
Add to web.xml:
<filter>
<filter-name>UrlRewriteFilter</filter-name>
<filter-class>org.tuckey.web.filters.urlrewrite.UrlRewriteFilter</filter-class>
</filter>
<filter-mapping>
<filter-name>UrlRewriteFilter</filter-name>
<url-pattern>/*</url-pattern>
<dispatcher>REQUEST</dispatcher>
<dispatcher>FORWARD</dispatcher>
</filter-mapping>
Create WEB-INF/urlrewrite.xml with this rules (or similar ones):
<urlrewrite>
<rule>
<condition type="request-uri" operator="notequal">.*bitstream.*</condition>
<condition type="request-uri">.*(search-filter|browse|discover|statistics|recent-submissions|request-a-copy).*</condition>
<condition name="user-agent">.*(bot|crawl|robot).*</condition>
<from>.*</from>
<set type="status">403</set>
<to>null</to>
</rule>
</urlrewrite>
According to my reading of how Google interprets the robots.txt standard, it might be possible with either:
Disallow: /items*request-a-copy
... or:
Disallow: /*request-a-copy
They say Googlebot has limited support for wildcards.
I just learned about google/robotstxt, which is Google's robots.txt
parser. Would be easy to use that to check some of our questions about syntax.
@amgciadev I have tested this now and confirm that Google's robots.txt
parser supports the syntax you propose.
$ ./robots robots.txt 'Googlebot' 'http://repository.edu/items/91ff4bfa-f086-48bf-a7b1-bbf0f7bceb26/request-a-copy'
user-agent 'Googlebot' with URI 'http://repository.edu/items/91ff4bfa-f086-48bf-a7b1-bbf0f7bceb26/request-a-copy': ALLOWED
$ ./robots robots.txt 'Googlebot' 'http://repository.edu/items/91ff4bfa-f086-48bf-a7b1-bbf0f7bceb26/request-a-copy'
user-agent 'Googlebot' with URI 'http://repository.edu/items/91ff4bfa-f086-48bf-a7b1-bbf0f7bceb26/request-a-copy': DISALLOWED
So if Google is indexing these links in your repository then it seems this will work to dissuade them. I have not noticed them indexing request a copy links in our large repository. My other concern is that this does not cover the case of entities, for example:
https://repository.edu/entities/publication/8d82004e-8665-4488-af8a-b10ac8f2a3ef/request-a-copy
We would need a more generic disallow like: Disallow: /*/request-a-copy
We are working on improving Google indexing and have noticed that pages in the form /items/*/request-a-copy are indexed in Google. Would it make sense to further enhance robots.txt with the optional statement:
Disallow: /items/*/request-a-copy