Python-Markdown / markdown

A Python implementation of John Gruber’s Markdown with Extension support.
https://python-markdown.github.io/
BSD 3-Clause "New" or "Revised" License
3.74k stars 858 forks source link

Table of content slugify generates invalid IDs #1320

Closed nicbou closed 1 year ago

nicbou commented 1 year ago

The heading ### 1. Create an account has the ID #1-create-an-account, which is not valid because it starts with a digit. This breaks document.querySelector and document.getElementById.

Thankfully, slugify is configurable, but it's something that should likely be corrected on the default function too.

facelessuser commented 1 year ago

It's not invalid per se, it just requires CSS escaping in a selector to target it. If you do not want your ID to start with a number, which it will if you let it get auto-slugified, then you should specify a specific ID that you are okay with (make sure you've enabled attr_list extension.

### 1. Create an account {#create-an-account}
waylan commented 1 year ago

I think the appropriate question here is this: Is there a spec which explicitly disallows a digit as the first character of an id? If so, we should somehow address that. If not, then @facelessuser's suggested workarounds are sufficient.

nicbou commented 1 year ago

I did a bit of research, and HTML 5 does allow an ID like "1-test-title". HTML 4 had more restrictions. Like @facelessuser said, it's only a problem with CSS selectors.

document.getElementById('1-test-title') works fine, but document.querySelector('#1-test-title') causes a DOMException.

In hindsight, this is not an issue that Python-Markdown should deal with.

If you have the same problem as I did, you can fix it like this:

from markdown.extensions.toc import TocExtension, slugify

def patched_slugify(value, separator, keep_unicode=False):
    return slugify(value.lstrip(' 0123456789'), separator, keep_unicode)

class MarkdownProcessor(EntryContextProcessor):
    def __init__(self):
        super().__init__()

        self.markdown = markdown.Markdown(extensions=[
            # ...
            TocExtension(slugify=patched_slugify),
            # ...
facelessuser commented 1 year ago

If we are talking about CSS selectors, the spec states that identifiers do not start with numbers. An ID is #<identifier>, classes are .<identifier>. This does not make the ID or classes invalid as classes and identifiers, but you have to use CSS escapes to specify a class or ID that starts with numbers, then it will work.

If I want to target a CSS ID that starts with zero:

document.querySelector('#\\31 -test-title')

Now, does this work directly in a CSS file? Maybe not. But there is no HTML restriction.

facelessuser commented 1 year ago

Interestingly, there are a lot of Markdown parsers that just pass leading numbers through during slugification. There are some that ignore starting numbers, and (GFM) that append user-content on all IDs to avoid this issue.

facelessuser commented 1 year ago

On another note, using numbers at the beginning of an ID does not affect jumping to an ID in a browser, so this is only a CSS restriction. I'm personally not against stripping leading numbers or altering them in some way, but it is technically not invalid HTML, just not easily targetable in a CSS file, but it is targettable with JS `querySelector.

I think that provides enough info for an informed decision.

facelessuser commented 1 year ago

An alternative to stripping the number could be to simply prepend - or _ in front of IDs that start with numbers.

waylan commented 1 year ago

Thanks for the information. As there is no such restriction in the HTML spec and most other implementations have the same/similar behavior, I an inclined to leave the default behavior as-is. And as we already provide a way for a user to provide their own slugify function, I see no reason to take any action.

nicbou commented 1 year ago

I fully agree. Thanks for taking the time. Python Markdown is a great project, and I'm grateful for your work.