Closed andinus closed 3 months ago
@andinus Here's what I know:
docs-dev.raku.org
is a deployment site for a development branch to test things before moving big changes to production. It will contain incremental UX changes mainly. It's under Raku community controlnew-raku.finanalyst.org
is my personal deployment site where I test things before moving to a docs-dev environment and has things that docs-dev does not, such as an error-report page, and a Collection plugin examples page.raku.finanalyst.org
is an archived instance of the old website built using my Collection toolchain as a proof of concept. It is not updated rakudocs.github.io
looks to be an archived instance of the old website, and uses a difference toolchain to build the documents.docs-stage.raku.org
gives me a certificate warning. I don't know about this, it could be a precursor to docs-dev@dontlaugh do you know about docs-stage.raku.org
?
@andinus do you have a sample 'robots.txt' we could tweak?
docs-stage is a subdomain we control and should probably just point to docs-dev with a permanent redirect
I think on all other sites apart from docs.raku.org
we should ask the search engine to not scrape at all, this should do the trick:
User-agent: *
Disallow: /
For this, we need to be able to differentiate between "production" instance and all other instances, maybe we can make it generate this robots.txt unless an "PRODUCTION_RAKUDOC" environment variable is present?
Here is an more comprehensive example: https://tildes.net/robots.txt
I can handle that for the container. I think that sort of configurability might be available in the Caddyfile. But failing that we can embed a script in the container.
We should be able to respond with a literal robots.txt using the heredoc feature of this directive https://caddyserver.com/docs/caddyfile/directives/respond
I just need to figure out how to use an env var as a conditional. I'm fairly certain it's possible.
Update, maybe this will work as a start. I will test soon
diff --git a/Caddyfile b/Caddyfile
index 8e0ce72..78ed22a 100644
--- a/Caddyfile
+++ b/Caddyfile
@@ -6,6 +6,21 @@
output stdout
}
+ @prod {
+ expression {env.PROD} == "true"
+ }
+
+ handle @prod {
+ respond /robots.txt 404
+ }
+
+ handle {
+ respond /robots.txt 200 {
+ body "User-agent: *\nDisallow: /"
+ close
+ }
+ }
+
root * /usr/share/caddy
encode gzip
I think I got it; opened PR #385
We've got this up on docs-dev
https://docs-dev.raku.org/robots.txt https://docs-stage.raku.org/robots.txt
Prod returns 404 https://docs.raku.org/robots.txt
Great! I'm not sure if search engines will remove existing pages with this approach but at least they won't add any new ones.
Searching for "raku docs" on DuckDuckGo returns:
in the first page, rest of the links look okay. Maybe we should add a robots.txt rule to prevent indexing by default and enable it only for docs.raku.org - on Google I see links to http://164.90.207.89:10010/ & https://raku.finanalyst.org/ with the same search term.
Here is the screenshot for DDG: