CivicActions / guidebook

The home of policies and guidelines that make up CivicActions
https://guidebook.civicactions.com/en/latest/
Creative Commons Attribution 4.0 International
19 stars 57 forks source link

The handbook sitemap.xml is referencing branches, causing broken links in google search #686

Closed dmundra closed 3 years ago

dmundra commented 3 years ago

A random issue I stumbled upon. It appears that the sitemap.xml (https://handbook.civicactions.com/sitemap.xml) for the handbook is only listing links to various branches. I am guessing that Google is then indexing the pages from that link and generating links that no longer work.

An example of the issue (worked on Jan 29, 2021):

Probably best would be that the sitemap.xml does not include branches and if possible include all actual pages under the latest branch.

grugnog commented 3 years ago

This sitemap is managed by readthedocs and I don't think we can update/disable it directly. We could disable rtd-bot, which spins these review environments up/down, although they can be helpful at times. https://stackoverflow.com/posts/63581610/revisions does has an alternate approach we could explore also.

dmundra commented 3 years ago

I think the review environments are definitely useful. Nice find on the stackoverflow. I looked at https://handbook.civicactions.com/robots.txt and see that it is already disallowing those review environments so I wonder why Google indexed the older ones. More research I guess to figure out the cause.

grugnog commented 3 years ago

@dmundra I manually hid those environments earlier to see what that would do, so I think that does work but isn't a long term solution - I am not sure if there is a way we can make the rtd-bot hide these by default - that could work?

dmundra commented 3 years ago

@grugnog I couldn't find much documentation on rtd-bot configuration or anything related to rtd-bot. My google-fu is low on that one.

The RTD documentation mentions exactly what you saw with the sitemap.xml so maybe following the StackOverflow answer and provide a custom robots.txt file that denies all and only allows only the latest URL. Thoughts?

dmundra commented 3 years ago

Updated robots.txt to let Google know to ignore other branches: https://handbook.civicactions.com/en/latest/README/robots.txt