DSpace / dspace-angular

DSpace User Interface built on Angular.io
https://wiki.lyrasis.org/display/DSDOC8x/
BSD 3-Clause "New" or "Revised" License
126 stars 417 forks source link

Further enhance robots.txt #2225

Open amgciadev opened 1 year ago

amgciadev commented 1 year ago

We are working on improving Google indexing and have noticed that pages in the form /items/*/request-a-copy are indexed in Google. Would it make sense to further enhance robots.txt with the optional statement:

Disallow: /items/*/request-a-copy

tdonohue commented 1 year ago

@amgciadev : If you've found that Disallow is working to remove those pages from Google indexing, then I agree it might be useful to add to the default robots.txt.

Feel free to send us a PR and we can get it reviewed. Thanks!

aroman-arvo commented 1 year ago

wildcards are not supported by the robots.txt specification. Ending wildcards are usually honored: xxx/ , but middle ones are not: xxx//yyy Adding this to the fact that many collectors do not respect robots, I would apply these rules (or similar ones) to Apache.

edit site to add rules (http o https)

vi /etc/apache2/sites-enabled/default-ssl.conf

Maybe is needed to add "RewriteEngine on"

add to file

RewriteCond %{HTTP_USER_AGENT} (bot|crawl|robot) [NC] RewriteCond %{REQUEST_URI} !bitstream RewriteRule ^(.*)(search-filter|browse|discover|statistics|recent-submissions|request-a-copy)(.*) - [F,L]

It can be done in tomcat using rewriteFilter.jar (http://www.tuckey.org/urlrewrite/)

Add to pom.xml:

<dependency>
    <groupId>org.tuckey</groupId>
    <artifactId>urlrewritefilter</artifactId>
    <version>4.0.3</version>
</dependency>

Add to web.xml:

<filter>
    <filter-name>UrlRewriteFilter</filter-name>
    <filter-class>org.tuckey.web.filters.urlrewrite.UrlRewriteFilter</filter-class>
</filter>
<filter-mapping>
    <filter-name>UrlRewriteFilter</filter-name>
    <url-pattern>/*</url-pattern>
    <dispatcher>REQUEST</dispatcher>
    <dispatcher>FORWARD</dispatcher>
</filter-mapping>

Create WEB-INF/urlrewrite.xml with this rules (or similar ones):

<urlrewrite>
    <rule>
      <condition type="request-uri" operator="notequal">.*bitstream.*</condition>
      <condition type="request-uri">.*(search-filter|browse|discover|statistics|recent-submissions|request-a-copy).*</condition>
      <condition name="user-agent">.*(bot|crawl|robot).*</condition>
      <from>.*</from>
      <set type="status">403</set>
      <to>null</to>
    </rule>
</urlrewrite>
alanorth commented 8 months ago

According to my reading of how Google interprets the robots.txt standard, it might be possible with either:

Disallow: /items*request-a-copy

... or:

Disallow: /*request-a-copy

They say Googlebot has limited support for wildcards.