Islandora / documentation

Contains islandora's documentation and main issue queue.
MIT License
104 stars 73 forks source link

[DOCS] Document how to deal with bots on live sites #2286

Open joshdentremont opened 4 months ago

joshdentremont commented 4 months ago

We should write up some docs about how to block bots from crawling a live site that is set up with Docker. This has come up a few times in Slack and it would be good to have something to explain how to deal with it.

One option some of us have been using is to edit drupal.defaults.conf to return a 403 based on user agent. I have done this by adding the following to my Dockerfile, but you could also mount the conf file and edit it manually:

# block bots in nginx
RUN echo -e '\n\
if ($http_user_agent ~ (Bytespider|Sogou|SemrushBot|AcademicBotRTU|PetalBot|GPTBot|DataForSeoBot|test-bot) ) { \n\
    return 403; \n\
}'\
>> /etc/nginx/shared/drupal.defaults.conf

It would also be nice to document how to block by IP address using Docker.

Related, but possibly a separate issue, is that bots are getting stuck looping over facets. I'm seeing this on my site with legit bots as well, like bingbot. If there is a way to prevent this we should document that as well.

mjordan commented 4 months ago

bots are getting stuck looping over facets

We've experienced this as well and it's brought out site to its knees.

ajstanley commented 4 months ago

Same. Tiktok ignores robots.txt. We have one sight that was getting several hits per second before we stuck a user agent filter in.

Natkeeran commented 4 months ago

Drupal specific info here: https://dev.acquia.com/blog/automated-bot-traffic-strategies-handle-and-manage-it

joshdentremont commented 4 months ago

Suggestions from tech call below:

Blocking bots by user agent:

Stopping legit bots from crawling facets:

Remaining questions:

ajstanley commented 4 months ago

Nginx allows for multiple conf files. We could add an include in nginx.conf to point to a file in /var/www/drupal which would eliminate the need for a separate mount.

kayakr commented 2 months ago

fwiw, I've found the patch for facets at https://www.drupal.org/node/2937191 useful; it converts the facets into actual checkboxes instead of the default that renders them as links (followable by bots) that get converted to checkboxes by js.

joshdentremont commented 2 months ago

@kayakr that would be awesome if we could get that patched into the facets module. I really like the checkboxes for facets but am having the same issue with bots.