bcgov / orgbook-bc-client

Vue.js rewrite of OrgBook BC
Apache License 2.0
6 stars 8 forks source link

Update robots.txt file to stop legitimate services from unnecessary scanning paths such as the api #195

Closed WadeBarnes closed 1 year ago

WadeBarnes commented 1 year ago

For example we're seeing the API being scanned by Bytespider.

The robots.txt file is defined, but does not specify any rules.

Are there other setting/files we can use to deter legitimate services from scanning the API unnecessarily?

WadeBarnes commented 1 year ago

In this case the organization behind Bytespider is well known for not respecting the robots.txt file, but we need to update it anyway.

WadeBarnes commented 1 year ago

From @cvarjao: https://developers.google.com/search/docs/crawling-indexing/robots/robots_txt#syntax

cvarjao commented 1 year ago

The Disallow path should start with /

User-agent: *
Disallow: /

Some other comments: https://www.feitsui.com/en/article/32

It seems Bytespider and Sogou spiders are not fully compatible with robots exclusion standard. These crawlers magically disappeared one week after I created a separate block for each user agent in robots.txt.

WadeBarnes commented 1 year ago

Other sources indicate they do not respect the robots.txt file at all.

WadeBarnes commented 1 year ago

That said it's worth adding the explicit entries into the robots.txt file to try things out.