Raku / doc-website

Tooling to build/run the documentation website
Artistic License 2.0
7 stars 10 forks source link

Searching for "raku docs" #384

Closed andinus closed 3 months ago

andinus commented 3 months ago

Searching for "raku docs" on DuckDuckGo returns:

in the first page, rest of the links look okay. Maybe we should add a robots.txt rule to prevent indexing by default and enable it only for docs.raku.org - on Google I see links to http://164.90.207.89:10010/ & https://raku.finanalyst.org/ with the same search term.

Here is the screenshot for DDG: Screenshot 2024-05-15 at 11-06-14 raku docs at DuckDuckGo

finanalyst commented 3 months ago

@andinus Here's what I know:

@dontlaugh do you know about docs-stage.raku.org?

@andinus do you have a sample 'robots.txt' we could tweak?

dontlaugh commented 3 months ago

docs-stage is a subdomain we control and should probably just point to docs-dev with a permanent redirect

andinus commented 3 months ago

I think on all other sites apart from docs.raku.org we should ask the search engine to not scrape at all, this should do the trick:

User-agent: *
Disallow: /

For this, we need to be able to differentiate between "production" instance and all other instances, maybe we can make it generate this robots.txt unless an "PRODUCTION_RAKUDOC" environment variable is present?

Here is an more comprehensive example: https://tildes.net/robots.txt

dontlaugh commented 3 months ago

I can handle that for the container. I think that sort of configurability might be available in the Caddyfile. But failing that we can embed a script in the container.

dontlaugh commented 3 months ago

We should be able to respond with a literal robots.txt using the heredoc feature of this directive https://caddyserver.com/docs/caddyfile/directives/respond

I just need to figure out how to use an env var as a conditional. I'm fairly certain it's possible.

Update, maybe this will work as a start. I will test soon

diff --git a/Caddyfile b/Caddyfile
index 8e0ce72..78ed22a 100644
--- a/Caddyfile
+++ b/Caddyfile
@@ -6,6 +6,21 @@
         output stdout
     }

+    @prod {
+        expression {env.PROD} == "true"
+    }
+
+    handle @prod {
+        respond /robots.txt 404
+    }
+
+    handle {
+        respond /robots.txt 200 {
+            body "User-agent: *\nDisallow: /"
+            close
+        }
+    }
+
     root * /usr/share/caddy

     encode gzip
dontlaugh commented 3 months ago

I think I got it; opened PR #385

logs ``` coleman@trajan ~/Code/Raku/doc-website (robots-txt*) 0 % podman run --rm --name docs -d -p 8080:80 quay.io/colemanx/raku-doc-website:robotstxt ae8089914d3b0dc28367f2e4562a828213d25c3239cfbd820f5461e9b0c09b83 coleman@trajan ~/Code/Raku/doc-website (robots-txt*) 0 % curl -v http://localhost:8080/robots.txt * Host localhost:8080 was resolved. * IPv6: ::1 * IPv4: 127.0.0.1 * Trying [::1]:8080... * Connected to localhost (::1) port 8080 > GET /robots.txt HTTP/1.1 > Host: localhost:8080 > User-Agent: curl/8.7.1 > Accept: */* > * Request completely sent off < HTTP/1.1 200 OK < Connection: close < Content-Type: text/plain; charset=utf-8 < Server: Caddy < Date: Thu, 16 May 2024 02:26:40 GMT < Content-Length: 25 < User-agent: * * Closing connection Disallow: /% coleman@trajan ~/Code/Raku/doc-website (robots-txt*) 0 % podman stop docs docs coleman@trajan ~/Code/Raku/doc-website (robots-txt*) 0 % podman run -e PROD=true --rm --name docs -d -p 8080:80 quay.io/colemanx/raku-doc-website:robotstxt 02c5e57a0a4b79bc842e8a2593081ce1968c799040b465dcbfac66f2be033c1c coleman@trajan ~/Code/Raku/doc-website (robots-txt*) 0 % curl -v http://localhost:8080/robots.txt * Host localhost:8080 was resolved. * IPv6: ::1 * IPv4: 127.0.0.1 * Trying [::1]:8080... * Connected to localhost (::1) port 8080 > GET /robots.txt HTTP/1.1 > Host: localhost:8080 > User-Agent: curl/8.7.1 > Accept: */* > * Request completely sent off < HTTP/1.1 404 Not Found < Server: Caddy < Date: Thu, 16 May 2024 02:27:14 GMT < Content-Length: 0 < * Connection #0 to host localhost left intact ```
dontlaugh commented 3 months ago

We've got this up on docs-dev

https://docs-dev.raku.org/robots.txt https://docs-stage.raku.org/robots.txt

Prod returns 404 https://docs.raku.org/robots.txt

andinus commented 3 months ago

Great! I'm not sure if search engines will remove existing pages with this approach but at least they won't add any new ones.