Closed basisbit closed 4 years ago
@cultpony according to their websites and other recent users reports, these bots do comply to the robots.txt as long as there aren't any parsing errors in the robots.txt.
not exactly my experience dealing with their bots, usually baidu is worse though so might affect my opinion. Is there a log monitoring system in place so we can monitor how much crawlers affect the site?
there is the docker log, and nginx logs the requests, user agent string and status codes (but not IP address or similar). Blocking baidu would accept users because then the website will not show up in the baidu search engine any more. Baidu is one of the biggest search engines, so I'd suggest not blocking it - if at all, rate-limit it.
I would atleast recommend blocking it on anything not frontpage, baidu doesn't care about deep-link content much and we don't really target baidu as SEO since we don't offer chinese content. I would definitely setup a solution to monitor and analyze where requests hitting the server are coming from, otherwise we'll be blind if a crawler decides to spam requests at the site.
the boorus mostly only offer english content. That doesn't limit it to only visitors from UK, USA and Australia, either, thus, I don't understand your argument "no chinese content". Various fandoms (also the brony fandom) have people that live in China. Those people will probably also use their typical search engines to search for art using various search keywords. The tagging system are what allows Philomena to be so successful: people search for a certain piece of art in their search engine, and the picture search will show them the desired page from the booru as result. Only allowing one of the big search engines to scan the front page (which barely has any content a search engine can process) will result in barely anyone finding that website who isn't exactly searching for the name of the website. Also, that will result in bad page ranking because of the website not having any index-relevant content.
Any monitoring of course can be set up, but that is not the goal of this issue here.
Regarding the "blind if a crawler decides to DoS the website": I suggest using cloudflare or any other system that supports rate-limiting + monitoring of too many requests (you can even implement this using iptables/nftables by limiting new tcp connections per second).
the important change of this PR imho is that search pages (/images and various/tags/tagname) aren't indexed by the bots any more, thus reducing load on the elasticsearch + postgres. The logging/monitoring solution which we want to additionally implement will be discussed in #77
Co-Authored-By: liamwhite liamwhite@users.noreply.github.com
Before you begin
I understand my contributions may be rejected for any reason
I understand my contributions are for the benefit of this imageboard software
I understand my contributions are licensed under the GNU AGPLv3
[x] I understand all of the above
disallow bad/spammy bots that are not used by users targeted by philomena.
Also, cherry picked https://github.com/philomena-dev/philomena/commit/7046857cbda737d29576ae1967cb1d2cd43b1f4c