freedomofpress / pressfreedomtracker.us

Code for the U.S. Press Freedom Tracker project website
https://pressfreedomtracker.us
GNU Affero General Public License v3.0
17 stars 7 forks source link

Create robots.txt #1761

Closed eloquence closed 11 months ago

eloquence commented 1 year ago

We're seeing a lot of crawler traffic, which can slow down other site operations. Let's create a carefully crafted robots.txt that defines what 's OK to crawl and what's not.

soleilera commented 1 year ago

Let's add this across all sites when working on this one.

SaptakS commented 1 year ago

Are there specific paths we want to block from the crawlers? Or do we want to disallow all paths from all crawlers?

harrislapiroff commented 1 year ago

I think we should disallow crawling of querystrings on the incident database page. I think that would be /all-incidents/?*

eloquence commented 12 months ago

@harrislapiroff @SaptakS Are historical incidents crawlable without hitting URLs like https://pressfreedomtracker.us/all-incidents/?page=2? If not I think we need to be more careful to ensure our site remains fully externally searchable.

SaptakS commented 12 months ago

I doubt it unless someone uses our API endpoints? Which also raises a good question of whether I should Disallow API endpoints.

As of making the site searchable, there are 2 ways to sovle this problem.

I think we can even do both. Wagtail does allow us to create sitemaps (https://docs.wagtail.org/en/stable/reference/contrib/sitemaps.html), so if this is something we want to do, I can extend this issue to add a sitemap as well.

harrislapiroff commented 12 months ago

Let's expand this ticket to include implementing a site map. We've been talking about it for a while and it seems like an improvement with multiple benefits. Thanks, @SaptakS