carpentries / sandpaper

User Interface for The Carpentries Workbench
https://carpentries.github.io/sandpaper
Other
42 stars 28 forks source link

Implement canonical URLs and redirects (if possible) #440

Open zkamvar opened 1 year ago

zkamvar commented 1 year ago

Initially brought up in #43, but never actually moved beyond discussion are the idea of canonical URLs.

Basically, if someone wants to visit https://carpentries.github.io/sandpaper-docs/episodes.html, they can do so with two links:

but if they use https://carpentries.github.io/sandpaper-docs/episodes/, or https://carpentries.github.io/sandpaper-docs/episodes/index.html then they get a 404.

The reason for this is because the first two links point to a file, but the last two links point to a folder and analytics will see all of them as different unless we establish a canonical URL.

{pkgdown} has implemented redirects, but I am not sure how they will work for this because we want a redirect that exists inside of a folder with the same name as the file.

bencomp commented 1 year ago

It is interesting that the URLs with .html and without both resolve, because I don't see two files when I build a lesson. Is that a GitHub thing?

Just noting that canonical URLs for the whole lesson came up in #481 as a building block to link episodes/chapters to the lesson in the metadata.

zkamvar commented 1 year ago

It is interesting that the URLs with .html and without both resolve, because I don't see two files when I build a lesson. Is that a GitHub thing?

I did not think about this, but yes, this is absolutely a GitHub thing and it runs into the boundaries of my knowledge of networking -_-

Take for example the beta phase preview of the lessons (deployed on AWS):

https://preview.carpentries.org/instructor-training/02-practice-learning.html (works)
https://preview.carpentries.org/instructor-training/02-practice-learning (fails)

$ curl -I https://preview.carpentries.org/instructor-training/02-practice-learning.html
HTTP/1.1 200 OK
Content-Type: text/html
Content-Length: 62331
Connection: keep-alive
Date: Thu, 29 Jun 2023 13:27:47 GMT
Last-Modified: Tue, 27 Jun 2023 00:16:47 GMT
ETag: "2fab9dad8bdfa9df0a1753d25a4bb2cf"
Server: AmazonS3
Vary: Accept-Encoding
X-Cache: Miss from cloudfront
Via: 1.1 d6cbeccd9a6d25b691d204399bf8b728.cloudfront.net (CloudFront)
X-Amz-Cf-Pop: SFO5-P2
X-Amz-Cf-Id: ftzMBTuQ4JvGacNZZ70dw3ZYTFJHJ0wyhmSI49x5uIiMKkhtgTi1ZQ==
X-XSS-Protection: 1; mode=block
X-Frame-Options: SAMEORIGIN
Referrer-Policy: strict-origin-when-cross-origin
X-Content-Type-Options: nosniff
Strict-Transport-Security: max-age=31536000
Vary: Origin

$ curl -I https://preview.carpentries.org/instructor-training/02-practice-learning
HTTP/1.1 403 Forbidden
Connection: keep-alive
x-amz-error-code: AccessDenied
x-amz-error-message: Access Denied
Date: Thu, 29 Jun 2023 13:27:49 GMT
Server: AmazonS3
X-Cache: Error from cloudfront
Via: 1.1 94be61e339880d0097634de6934f7710.cloudfront.net (CloudFront)
X-Amz-Cf-Pop: SFO5-P2
X-Amz-Cf-Id: zaSAKzoVYtmJPsIR2xmodwiUhDMtAhDlC5bzSMo8ixBR6iiWLPaDaA==
X-XSS-Protection: 1; mode=block
X-Frame-Options: SAMEORIGIN
Referrer-Policy: strict-origin-when-cross-origin
X-Content-Type-Options: nosniff
Strict-Transport-Security: max-age=31536000
Vary: Origin

When I look at the pages on GitHub, there is no difference between the pages; not even a redirect:

$ curl -I https://carpentries.github.io/sandpaper-docs/episodes.html
HTTP/1.1 200 OK
Connection: keep-alive
Content-Length: 90902
Server: GitHub.com
Content-Type: text/html; charset=utf-8
permissions-policy: interest-cohort=()
Last-Modified: Tue, 27 Jun 2023 00:26:06 GMT
Access-Control-Allow-Origin: *
ETag: "649a2c9e-16316"
expires: Thu, 29 Jun 2023 13:36:04 GMT
Cache-Control: max-age=600
x-proxy-cache: MISS
X-GitHub-Request-Id: 6B52:9B94:500FB1:5EF38F:649D866C
Accept-Ranges: bytes
Date: Thu, 29 Jun 2023 13:28:41 GMT
Via: 1.1 varnish
Age: 157
X-Served-By: cache-pdx12332-PDX
X-Cache: HIT
X-Cache-Hits: 1
X-Timer: S1688045321.324659,VS0,VE1
Vary: Accept-Encoding
X-Fastly-Request-ID: 9a03fb665121bdd4a2d53f703447089dce0becdc

$ curl -I https://carpentries.github.io/sandpaper-docs/episodes
HTTP/1.1 200 OK
Connection: keep-alive
Content-Length: 90902
Server: GitHub.com
Content-Type: text/html; charset=utf-8
permissions-policy: interest-cohort=()
Last-Modified: Tue, 27 Jun 2023 00:26:06 GMT
Access-Control-Allow-Origin: *
ETag: "649a2c9e-16316"
expires: Thu, 29 Jun 2023 13:35:58 GMT
Cache-Control: max-age=600
x-proxy-cache: MISS
X-GitHub-Request-Id: 5FE4:84EB:50548F:5F37A8:649D8665
Accept-Ranges: bytes
Date: Thu, 29 Jun 2023 13:28:44 GMT
Via: 1.1 varnish
Age: 166
X-Served-By: cache-pdx12331-PDX
X-Cache: HIT
X-Cache-Hits: 1
X-Timer: S1688045324.075636,VS0,VE1
Vary: Accept-Encoding
X-Fastly-Request-ID: ff6c4cda35fb5b4299f55d7f949a79ecaad846f3
bencomp commented 1 year ago

Thanks for doing this research. I feel that The Workbench should not rely on this GitHub feature and use the .html URLs as canonical. I noticed the variants while working on https://github.com/carpentries/lesson-development-training/pull/209.

To signal which URL is canonical, you could (or perhaps should) use RFC 6596.