Disable search engine crawling of non-canonical forks

PathogenDavid commented 3 months ago

It can be convenient in forks to enable deployment to GitHub Pages for the purposes of testing. However this inadvertently creates duplicate copies of the documentation accessible on the wider public internet, which means search engines have the potential to find them.

This runs the risk of polluting search results with content which is likely outdated. I believe it also runs the risk of harming the SEO of the official documentation website. (I'm no SEO expert but my understanding is Google in particular harshly penalizes websites which duplicate other websites.)

We should ~~generate a robots.txt and/or~~ add the appropriate meta tags to non-canonical copies of the docs website.

As a semi-related aside (since you specify it in the robots.txt), we should also enable the sitemap.xml generation. Looks like it just needs to be turned on.

glopesdev commented 3 months ago

Having a robots.txt makes complete sense, I just never got to dive into how it works properly 🙂

PathogenDavid commented 3 months ago

One thing I didn't really think about when writing this is that the main website's robots.txt is what actually matters since the docs repo is nested in a subdirectory.

(Similarly for forks, the robots.txt in the GitHub Pages website of the user or the organization associated with the fork is what actually matters.)

This means we actually probably just go the route of adding <meta name="robots" content="noindex, nofollow"> tags to the <head> of every page instead.

bonsai-rx / docs

Disable search engine crawling of non-canonical forks #90