booru / philomena

Next-generation imageboard software. This software development project is independent from any image hosting project.
GNU Affero General Public License v3.0
9 stars 10 forks source link

disallow bad/spammy bots, also search engine spiders should do their own searching #75

Closed basisbit closed 4 years ago

basisbit commented 4 years ago

Co-Authored-By: liamwhite liamwhite@users.noreply.github.com

Before you begin


disallow bad/spammy bots that are not used by users targeted by philomena.

Also, cherry picked https://github.com/philomena-dev/philomena/commit/7046857cbda737d29576ae1967cb1d2cd43b1f4c

basisbit commented 4 years ago

@cultpony according to their websites and other recent users reports, these bots do comply to the robots.txt as long as there aren't any parsing errors in the robots.txt.

cultpony commented 4 years ago

not exactly my experience dealing with their bots, usually baidu is worse though so might affect my opinion. Is there a log monitoring system in place so we can monitor how much crawlers affect the site?

basisbit commented 4 years ago

there is the docker log, and nginx logs the requests, user agent string and status codes (but not IP address or similar). Blocking baidu would accept users because then the website will not show up in the baidu search engine any more. Baidu is one of the biggest search engines, so I'd suggest not blocking it - if at all, rate-limit it.

cultpony commented 4 years ago

I would atleast recommend blocking it on anything not frontpage, baidu doesn't care about deep-link content much and we don't really target baidu as SEO since we don't offer chinese content. I would definitely setup a solution to monitor and analyze where requests hitting the server are coming from, otherwise we'll be blind if a crawler decides to spam requests at the site.

basisbit commented 4 years ago

the boorus mostly only offer english content. That doesn't limit it to only visitors from UK, USA and Australia, either, thus, I don't understand your argument "no chinese content". Various fandoms (also the brony fandom) have people that live in China. Those people will probably also use their typical search engines to search for art using various search keywords. The tagging system are what allows Philomena to be so successful: people search for a certain piece of art in their search engine, and the picture search will show them the desired page from the booru as result. Only allowing one of the big search engines to scan the front page (which barely has any content a search engine can process) will result in barely anyone finding that website who isn't exactly searching for the name of the website. Also, that will result in bad page ranking because of the website not having any index-relevant content.

Any monitoring of course can be set up, but that is not the goal of this issue here.

Regarding the "blind if a crawler decides to DoS the website": I suggest using cloudflare or any other system that supports rate-limiting + monitoring of too many requests (you can even implement this using iptables/nftables by limiting new tcp connections per second).

basisbit commented 4 years ago

the important change of this PR imho is that search pages (/images and various/tags/tagname) aren't indexed by the bots any more, thus reducing load on the elasticsearch + postgres. The logging/monitoring solution which we want to additionally implement will be discussed in #77