ansible-community / community-website

Ansible Community website (WIP)
https://ansible-community-website.readthedocs.io
Creative Commons Attribution Share Alike 4.0 International
14 stars 25 forks source link

Strategy for robots.txt and sitemap #385

Closed oraNod closed 7 months ago

oraNod commented 8 months ago

This issue follows up on #279 and #355 and #322

@wbentley15 has communicated a request that robots.txt is updated to keep bots off the ansible.com site. As part of this request it was mentioned to add meta name="robots" content="none" to HTML pages. This meta would be effective in keeping bots off the site, however, it would also result in pages not returning in search results as it would keep all bots off the site. The content=none directive is equivalent to noindex and nofollow as per the documentation.

I've contacted an expert on search within Red Hat who has confirmed the above and advised that we take the approach of filtering bots in robots.txt.

To prevent a specific bot:

User-agent [user agent name]
disallow: /

For example to disallow the user agent scambot from the entire site:

User-agent [scambot]
disallow: /

To disallow the identity_theft bot from accessing certain directories of the site:

User-agent [identity_theft]
disallow: /userinfo/

To resolve this issue, we need to do the following before launch:

You can grab the robots.txt file on the dev site at: https://ansible-community-website.readthedocs.io/robots.txt

oraNod commented 7 months ago

Some additional information. The readthedocs project is in the "Active and Hidden" state: https://docs.readthedocs.io/en/stable/versions.html#hidden

This should prevent our test site from getting indexed. However we should also prevent the robots.txt file that nikola generates from being copied to readthedocs. Need to send a PR for that.

oraNod commented 7 months ago

The robots.txt file on readthedocs now uses the RTD settings to disallow the dev site: https://ansible-community-website.readthedocs.io/robots.txt

The robots.txt file that nikola generates does not disallow any content and uses default settings. The nikola robots.txt file does not get uploaded to readthedocs.