Create robots.txt - Githubissues

freedomofpress / pressfreedomtracker.us

Code for the U.S. Press Freedom Tracker project website

https://pressfreedomtracker.us

GNU Affero General Public License v3.0

17 stars 7 forks source link

Create robots.txt #1761

Closed eloquence closed 11 months ago

eloquence commented 1 year ago

We're seeing a lot of crawler traffic, which can slow down other site operations. Let's create a carefully crafted robots.txt that defines what 's OK to crawl and what's not.

soleilera commented 1 year ago

Let's add this across all sites when working on this one.

SaptakS commented 1 year ago

Are there specific paths we want to block from the crawlers? Or do we want to disallow all paths from all crawlers?

harrislapiroff commented 1 year ago

I think we should disallow crawling of querystrings on the incident database page. I think that would be /all-incidents/?*

eloquence commented 12 months ago

@harrislapiroff @SaptakS Are historical incidents crawlable without hitting URLs like https://pressfreedomtracker.us/all-incidents/?page=2? If not I think we need to be more careful to ensure our site remains fully externally searchable.

SaptakS commented 12 months ago

I doubt it unless someone uses our API endpoints? Which also raises a good question of whether I should Disallow API endpoints.

As of making the site searchable, there are 2 ways to sovle this problem.

Having a sitemap. It's actually recommended to have a sitemap in robots.txt that let's the crawlers know which sitemap to follow when crawling (though not all crawlers may restrict to the sitemap). It technically also fulfills one of the WCAG success criterion
Adding a ALLOW rule for the /?page=* pages.

I think we can even do both. Wagtail does allow us to create sitemaps (https://docs.wagtail.org/en/stable/reference/contrib/sitemaps.html), so if this is something we want to do, I can extend this issue to add a sitemap as well.

harrislapiroff commented 12 months ago

Let's expand this ticket to include implementing a site map. We've been talking about it for a while and it seems like an improvement with multiple benefits. Thanks, @SaptakS