bcgov / orgbook-bc-client

Vue.js rewrite of OrgBook BC
Apache License 2.0
6 stars 8 forks source link

Update robots.txt file to stop legitimate services from unnecessary scanning paths such as the api #195

Closed WadeBarnes closed 10 months ago

WadeBarnes commented 1 year ago

For example we're seeing the API being scanned by Bytespider.

The robots.txt file is defined, but does not specify any rules.

Are there other setting/files we can use to deter legitimate services from scanning the API unnecessarily?

WadeBarnes commented 10 months ago

In this case the organization behind Bytespider is well known for not respecting the robots.txt file, but we need to update it anyway.

WadeBarnes commented 10 months ago

From @cvarjao: https://developers.google.com/search/docs/crawling-indexing/robots/robots_txt#syntax

cvarjao commented 10 months ago

The Disallow path should start with /

User-agent: *
Disallow: /

Some other comments: https://www.feitsui.com/en/article/32

It seems Bytespider and Sogou spiders are not fully compatible with robots exclusion standard. These crawlers magically disappeared one week after I created a separate block for each user agent in robots.txt.

WadeBarnes commented 10 months ago

Other sources indicate they do not respect the robots.txt file at all.

WadeBarnes commented 10 months ago

That said it's worth adding the explicit entries into the robots.txt file to try things out.