GetPublii / Publii

The most intuitive Static Site CMS designed for SEO-optimized and privacy-focused websites.
https://getpublii.com
GNU General Public License v3.0
6.35k stars 419 forks source link

[Feature Request]: More agents for dark visitors in robots.txt #1314

Open arthurzenika opened 10 months ago

arthurzenika commented 10 months ago

Feature Description

First of all, congratulations for having a setting out of the box for blocking ChatGPT and other bots in robots.txt, it's a super cool feature !

There are some more bots that could be added to the list : https://darkvisitors.com/

(I looked at the code to try to contribute the list directly, and maybe there could be a catchall setting instead of having a button per bot ? Or do you have some users that want to select them one by one ? )

atomGit commented 9 months ago

blocking anything via robots.txt is generally a poor approach - this is like me asking you to cease using the word 'the' in conversation; it's entirely your choice whether you do or don't - i had a chuckle when i saw this option was added to publii because it's pointless

while some language scrapers for so-called "AI" (there is no AI (yet), not that the public can access anyway) may obey robots.txt, others will not, thus attempting to block such requests via robots.txt is nothing but an exercise in enumerating badness

if you really want to block this type of requests, the tl;dr answer is: don't bother

the longer answer is that you may have to result in blocking consecutive requests from non-search engines because the UA/IP could be anything ... including a search engine or a genuine web browser, thus this approach is also useless (see previous answer)

i feel ya and i agree that one should be able to block this crap, but there is literally no reliable way to do so that i'm aware of - every site will be indexed for AI at some point, either directly or indirectly

raramuridesign commented 9 months ago

@arthurzenika We have found the most effective way to block bots is to rather use Cloudflare WAF Rules. Although this does require knowing which ones to block. We have a list of more than 50 bots we actively block on projects. You can read more here. M

internettips commented 9 months ago

Blocking bots may be more trouble than it is worth. Perhaps the best solution is to simply put your valuable content behind a password-protected / Passkey login. Interested parties will register -- in most instances, that could be a hand-raiser signal. With password-protected logins for premium content, casual bots get blocked by default. So do search engines for your valuable content (unless you create SE pass-throughs). Then develop a business sales strategy to focus on your password-protected content.

For content behind the password-protected barrier, if you're concerned search engines won't be able to index your content, with typically only 8 spots -- or less -- possible on page 1 of search results, is more than a token effort any more -- at least for most websites? There are better ways to find and connect with hand-raisers, then turn some of those into buyers. Search engine listings are a nice side benefit, not a core business driver. Or, if you have funds to spare, maybe you could use web adds to promote content. Though with AI developments about to gain another boost this year, the days of old-style SEO and even web advertising, are probably going to change sooner rather than later.

atomGit commented 9 months ago

We have found the most effective way to block bots is to rather use Cloudflare ...

that's mistake no. 2

Stay away from Cloudflare

why you shouldn't use Cloudflare - tiq's tech-blog

Why does cloudflare suck? : CloudFlare

cloudflare can take their gd annoying captcha crap and shove it where the sun don't shine, right along with their stupid custom http error codes

sites using this service are compromising user privacy and that shouldn't be acceptable, not by anyone