Refactor TOC sanitation

waylan commented 6 months ago

All postprocessors are run on heading content (not just RawHtmlPostprocessor).
Footnote references are stripped from heading content. Fixes #660.
A more robust striptags is provided to convert headings to plain text. Unlike, markupsafe's implementation, HTML entities are not unescaped.
Both the plain text name and rich html and unescaped raw data-toc-label are saved to toc_tokens, which means users can now access the full rich text content of the headings directly from the toc_tokens.
data-toc-label is sanitized separate from heading content.
~A html.unescape call added to slugify and slugify_unicode, which ensures slugify operates on Unicode characters, rather than HTML entities. By including in the functions, users can override with their own slugify functions if they desire.~ A html.unescape call is made just prior to calling slugify so that slugify only operates on Unicode characters. Note that html.unescape is not run on the name or html.

Note that this first commit includes minimal changes to the tests to show very little change in behavior (mostly the new html attribute of the toc_tokens was added). A refactoring of the tests will be in a separate commit.

waylan commented 6 months ago

In my initial commit, I deleted a couple public functions which are no longer needed. I probably should have left them in and marked them as deprecated instead. Do we need to do that? Are others calling them?

Also, it occurs to me that these changes could result in various slugs being different, which would break existing links out in the wild. True, none of the slugs in our own tests are changed, but by fully rendering the HTML before generating a slug could result in additional/different content from any third-party extensions that make use of postprocessors.

vedranmiletic commented 6 months ago

FWIW, I installed this using

pip install git+https://github.com/Python-Markdown/markdown.git@refs/pull/1441/head

and tested it on my group's website page that had issues, angle brackets are removed correctly and the e-mails show up nicely. Not sure I can help with actual code review, but from a user's perspective, this fixes the issue.

waylan commented 5 months ago

I finished updating the tests and documentation. I think this is ready to go.

waylan commented 5 months ago

@oprypin @pawamoy I wanted to check with you before I did a release of this. As this is a change in behavior that affects your projects (in that a bug was fixed which allowed markup to be passed through to toc_token['name'] in certain circumstance), I wanted to make sure you where ready for the release before I publish it.

Presumably at a minimum, MkDocs will need to do a release which includes mkdocs/mkdocs#3578 before I do my release. And I am assuming mkdostrings will want to do a release which takes advantage of mkdocs/mkdocs#3578 as well.

For the record, I intend to release this as version 3.6, so any earlier releases of your projects will likely want to be constrained to markdown<3.6 for any users who want markup in their toc token names.

pawamoy commented 5 months ago

Thanks for the heads up. I'm not ready for this. I'll start by adding an upper bound in mkdocstrings and publish a new release. I should be able to do that today, but if not, before the end of the week for sure :slightly_smiling_face: I'll report back once it's done!

oprypin commented 5 months ago

I think actually there's no direct interaction with MkDocs, the release can be made here first. Extracting the primary heading has a separate implementation in MkDocs and extracting secondary headings will just change for the better with this release but still nothing to be done in MkDocs

waylan commented 5 months ago

Well, there definitely is interaction with mkdocstrings. According to my own testing, the show_symbol_type_toc feature breaks with this change. Presumably, mkdocstings will need to retrieve the value from token['data-toc-label'] rather than token['name'] as it does now. Therefore, at a minimum, I'll wait for an indication that mkdocstrings is ready before doing a release.

pawamoy commented 5 months ago

I've published a new version of mkdocstrings-python with an upper bound on Python-Markdown 3.6, you can release it :)

pawamoy commented 5 months ago

Hello again! Today someone reported that we should support version 3.6, so I made time to work on this. And it turns out I can do this very simple thing:

class _TocLabelsTreeProcessor(Treeprocessor):
    def run(self, root: Element) -> None:  # noqa: ARG002
        self._override_toc_labels(self.md.toc_tokens)  # type: ignore[attr-defined]

    def _override_toc_labels(self, tokens: list) -> None:
        for token in tokens:
            if token["name"] != token["data-toc-label"]:
                token["name"] = token["data-toc-label"]
                self._override_toc_labels(token["children"])

...and register this tree processor in our outer MkdocstringsExtension:

        md.treeprocessors.register(
            _TocLabelsTreeProcessor(md),
            "mkdocstrings_post_toc_labels",
            priority=4.5,  # Right after 'toc'.
        )

I suppose it's a bit brutal and instead of blindly overwriting names with toc-labels, I should add a way to detect when it comes from mkdocstrings and overwrite only those? For example, if token["name"] != token["data-toc-label"] and token["data-toc-label"].startswith("<code")?

Otherwise it means I'm just removing the additional safety that was brought with this PR, right? Is this really bad or is there no practical impact :thinking:?

waylan commented 5 months ago

The primary concern for MkDocs is that the page title should not ever contain any markup (to avoid markup in <title> etc). As the page title comes from token['name'], then by extension, token['name'] should never contain any markup... but that only really matters for the top <h1> in a page. That said, in Markdown's own documentation, mkdocstrings does set a value to data-toc-label for the top <h1> on each page it generates.

So, yes, to avoid breaking themes (by causing them to generate invalid HTML), token['name'] should never contain any markup. The correct fix is to pull the value from token['data-toc-label'] (and why I added it). However, that would require an edit to theme templates when building the TOC. And that is something theme devs need to do, unless you can provide a way to override that block of a theme. IIRC, there is not a straightforward way for an extension to do that (it interferes with a user's overrides).

So, in the end, we have a chicken and egg problem. Theme's need to support token['data-toc-label'] directly when building a page's toc, but they are not likely to do that unless users and/or extensions are assigning values to token['data-toc-label'] which contain markup.

waylan commented 5 months ago

Two other factors come into play,

For a theme to access token['data-toc-label'], then MkDocs needs to pass that on. I don't think it does (needs confirmation).
In mkdocs/mkdocs#3578, the same sanitation code is used by MkDocs to sanitize page titles. Therefore, assigning markup to `token['name'] may not be a problem after all as MkDocs will sanitize it before setting the value to the page title.

In other words, at this time, your suggested solution may actually be the only workable one.

pawamoy commented 5 months ago

Thank you!

I'm not sure to understand why there is a data-toc-label attribute if it is only use to override the item name which is otherwise derived from the heading's text/id :thinking: I thought data-toc-label had exactly this use of providing an alternate label for the table of contents (label as, displayable text that doesn't affect anything else). But from your answer I gather that themes actually never used this attribute and always relied on the name itself, and it only seemed to work as I expected it to work because the toc-label was not sanitized properly. Now that it is, it unveiled the true inner working of themes, which is to use the name directly. Makes sense!

Anyway, I'll go with the solution above then, thanks! :slightly_smiling_face:

waylan commented 5 months ago

Prior to this present change, the assumption was that both name and data-toc-label never contained markup. It was a weird anomaly that this ever worked for you. Therefore, with this change only sanitized content of data-toc-label gets copied to name (no markup is preserved as name still never has markup). The thinking is that if you want the raw content of data-toc-label then you need to access it directly from data-toc-label. In other words, with this change, we are officially supporting using markup in data-toc-label but to access that markup you can't get if from name; you have to get it from data-toc-label directly.

Python-Markdown / markdown

Refactor TOC sanitation #1441