Closed eloquence closed 11 months ago
Let's add this across all sites when working on this one.
Are there specific paths we want to block from the crawlers? Or do we want to disallow all paths from all crawlers?
I think we should disallow crawling of querystrings on the incident database page. I think that would be /all-incidents/?*
@harrislapiroff @SaptakS Are historical incidents crawlable without hitting URLs like https://pressfreedomtracker.us/all-incidents/?page=2
? If not I think we need to be more careful to ensure our site remains fully externally searchable.
I doubt it unless someone uses our API endpoints? Which also raises a good question of whether I should Disallow API endpoints.
As of making the site searchable, there are 2 ways to sovle this problem.
ALLOW
rule for the /?page=*
pages.I think we can even do both. Wagtail does allow us to create sitemaps (https://docs.wagtail.org/en/stable/reference/contrib/sitemaps.html), so if this is something we want to do, I can extend this issue to add a sitemap as well.
Let's expand this ticket to include implementing a site map. We've been talking about it for a while and it seems like an improvement with multiple benefits. Thanks, @SaptakS
We're seeing a lot of crawler traffic, which can slow down other site operations. Let's create a carefully crafted robots.txt that defines what 's OK to crawl and what's not.