crate / crate-docs-theme

A Sphinx theme for the CrateDB documentation.
https://crate-docs-theme.readthedocs.io/
Apache License 2.0
20 stars 5 forks source link

Missing `index.html` within link generated to designate the canonical URL #483

Closed amotl closed 7 months ago

amotl commented 7 months ago

Problem

On a page rendered by index.rst/index.md files, like this one about built-in functions, the index.html page name is omitted on the rendered variant of the <link rel="canonical" representation.

<link rel="canonical" href="https://cratedb.com/docs/crate/reference/en/latest/general/builtins/" />

This flaw causes all sorts of downstream problems.

Details

@msbt is outlining more details about the problem. Thanks!

The massive amount of non-indexed pages are a result of our docs setup. The top 2 non-Google-issues (Alternate page with proper canonical tag and Page with redirect) are mostly because of the redirect chains and versioning that we have in place. If you take this URL as an example: https://cratedb.com/docs/crate/reference/en/master/general/builtins/subquery-expressions.html The page also exists in these (and probably some more) versions: https://cratedb.com/docs/crate/reference/en/5.6/general/builtins/subquery-expressions.html https://cratedb.com/docs/crate/reference/en/5.5/general/builtins/subquery-expressions.html Both links above have this URL set as canonical: https://cratedb.com/docs/crate/reference/en/latest/general/builtins/subquery-expressions.html This will obviously result in a lof of unindexed pages, not sure if this can be fixed since it's not really broken. To show an example of your links, this URL shows as not-indexed because of "Alternate page with proper canonical tag": https://cratedb.com/docs/guide/install/cloud/aws/index.html If you inspect that URL, you can see the "User-declared canonical" is: https://cratedb.com/docs/guide/install/cloud/aws/ (which is indexed) So the `index.html` gets omitted by RTD and every docs-page ending with `index.html` gets a not-indexed issue attached to it. Can we maybe add that to the canonical URL to avoid that @amotl?

References

/cc @matkuliak, @michaelkremmel

amotl commented 7 months ago

Observations

We did a few orientation flights on this topic together with @msbt, and came to the conclusion that RTD might have deprecated the "canonical_url" thing already, as it might only have been required for early versions of Sphinx<1.8 and RTD of that times.

Today, it is advised to use html_baseurl:

For sphinx >=1.8 we can use html_baseurl to set the canonical URL.

-- https://github.com/readthedocs/readthedocs.org/pull/7540/files

... but not define it:

If you are using Sphinx, Read the Docs will automatically add a default value of the html_baseurl setting matching your canonical domain.

If you are using a custom html_baseurl in your conf.py, you have to ensure that the value is correct. This can be complex, supporting pull request builds (which are published on a separate domain), special branches or if you are using subproject s or translations. We recommend not including a html_baseurl in your conf.py, and letting Read the Docs define it.

-- https://docs.readthedocs.io/en/stable/guides/canonical-urls.html

Thoughts

In this case, the section in readthedocs-insert.html.tmpl might actually be a backward-compatibility thing?

References I

We are not sure if each one of them is relevant. However, all are about fixing or improving the situation wrt. canonical links, in one way or another. In this spirit, I am enumerating them here, because there is a chance we missed something on the ugprade path since Sphinx 1.8 (~10 years ago?).

References II

Also discovered those, from 2023.

amotl commented 7 months ago

Through some cleanups and refactorings, we removed some configuration overhead, and fixed the issue described above, still using a few Crate-specific workarounds.

The improvements have been released with version 0.31.2. Thanks for your excellent support, @msbt! 💯