DSpace / dspace-angular

DSpace User Interface built on Angular.io
https://wiki.lyrasis.org/display/DSDOC8x/
BSD 3-Clause "New" or "Revised" License
135 stars 434 forks source link

Further enhance robots.txt #2225

Open amgciadev opened 1 year ago

amgciadev commented 1 year ago

We are working on improving Google indexing and have noticed that pages in the form /items/*/request-a-copy are indexed in Google. Would it make sense to further enhance robots.txt with the optional statement:

Disallow: /items/*/request-a-copy

tdonohue commented 1 year ago

@amgciadev : If you've found that Disallow is working to remove those pages from Google indexing, then I agree it might be useful to add to the default robots.txt.

Feel free to send us a PR and we can get it reviewed. Thanks!

aroman-arvo commented 1 year ago

wildcards are not supported by the robots.txt specification. Ending wildcards are usually honored: xxx/ , but middle ones are not: xxx//yyy Adding this to the fact that many collectors do not respect robots, I would apply these rules (or similar ones) to Apache.

edit site to add rules (http o https)

vi /etc/apache2/sites-enabled/default-ssl.conf

Maybe is needed to add "RewriteEngine on"

add to file

RewriteCond %{HTTP_USER_AGENT} (bot|crawl|robot) [NC] RewriteCond %{REQUEST_URI} !bitstream RewriteRule ^(.*)(search-filter|browse|discover|statistics|recent-submissions|request-a-copy)(.*) - [F,L]

It can be done in tomcat using rewriteFilter.jar (http://www.tuckey.org/urlrewrite/)

Add to pom.xml:

<dependency>
    <groupId>org.tuckey</groupId>
    <artifactId>urlrewritefilter</artifactId>
    <version>4.0.3</version>
</dependency>

Add to web.xml:

<filter>
    <filter-name>UrlRewriteFilter</filter-name>
    <filter-class>org.tuckey.web.filters.urlrewrite.UrlRewriteFilter</filter-class>
</filter>
<filter-mapping>
    <filter-name>UrlRewriteFilter</filter-name>
    <url-pattern>/*</url-pattern>
    <dispatcher>REQUEST</dispatcher>
    <dispatcher>FORWARD</dispatcher>
</filter-mapping>

Create WEB-INF/urlrewrite.xml with this rules (or similar ones):

<urlrewrite>
    <rule>
      <condition type="request-uri" operator="notequal">.*bitstream.*</condition>
      <condition type="request-uri">.*(search-filter|browse|discover|statistics|recent-submissions|request-a-copy).*</condition>
      <condition name="user-agent">.*(bot|crawl|robot).*</condition>
      <from>.*</from>
      <set type="status">403</set>
      <to>null</to>
    </rule>
</urlrewrite>
alanorth commented 10 months ago

According to my reading of how Google interprets the robots.txt standard, it might be possible with either:

Disallow: /items*request-a-copy

... or:

Disallow: /*request-a-copy

They say Googlebot has limited support for wildcards.

alanorth commented 1 day ago

I just learned about google/robotstxt, which is Google's robots.txt parser. Would be easy to use that to check some of our questions about syntax.

alanorth commented 1 day ago

@amgciadev I have tested this now and confirm that Google's robots.txt parser supports the syntax you propose.

Before:

$ ./robots robots.txt 'Googlebot' 'http://repository.edu/items/91ff4bfa-f086-48bf-a7b1-bbf0f7bceb26/request-a-copy'
user-agent 'Googlebot' with URI 'http://repository.edu/items/91ff4bfa-f086-48bf-a7b1-bbf0f7bceb26/request-a-copy': ALLOWED

After:

$ ./robots robots.txt 'Googlebot' 'http://repository.edu/items/91ff4bfa-f086-48bf-a7b1-bbf0f7bceb26/request-a-copy'
user-agent 'Googlebot' with URI 'http://repository.edu/items/91ff4bfa-f086-48bf-a7b1-bbf0f7bceb26/request-a-copy': DISALLOWED

So if Google is indexing these links in your repository then it seems this will work to dissuade them. I have not noticed them indexing request a copy links in our large repository. My other concern is that this does not cover the case of entities, for example:

https://repository.edu/entities/publication/8d82004e-8665-4488-af8a-b10ac8f2a3ef/request-a-copy

We would need a more generic disallow like: Disallow: /*/request-a-copy