Closed ghost closed 5 years ago
No need for an extra <a>
tag, you could add the id
attribute directly to the heading tag.
Looking into this, a few questions come to mind:
How do we want to sanitize the title so the identifier is both user-friendly and still valid according to the html spec?
My first idea was to just replace any non-alphanumerical character with a dash and turn everything to lowercase.
What’s next?
would be transformed to what-s-next-
Do we want to keep sanitized characters at the start/end of the identifier?
¿que pasa?
=> -que-pasa-
or que-pasa
?
Do we want to keep repeating dashed in the identifier?
Helix - test
=> helix---test
or helix-test
?
Identifiers have to be unique, so how would we want to handle 2 same labeled H2s in different H1s? I would propose to prefix with the parent H1 identifier. Something like:
h1(id="foo") Foo
h2(id="foo--baz") Baz
h1(id="bar") Bar
h2(id="bar--baz") Baz
Are we targeting HTML 4 or HTML4.1+?
The former was quite strict about identifiers (only [a-zA-Z][a-zA-Z0-9_-.:]+
if I remember), while the latter supports anything that is not a space (\s+
).
I'd be tempted to go with an HTML4 compatible id to be backward compatible, but would also be ok to relax the rule
Good questions ;) I think we should just do it exactly like Github does it. I assume this will answer all the questions, assuming they support each cases. By default, I see they do not prefix the h2 with h1 but could not quickly find a duplicate case.
Also, github.com adds a <a>
tag with the id mainly to be able to have it visual and be able to copy/paste the link easily. But default, this link could be hidden.
@kptdobe, @rofe: so GitHub only uses the <a>
because they actually make a visible link on the header so you can easily copy-paste. While this is usually a great pattern for documentation-related websites and wikis, I've rarely seen this used anywhere else (i.e. product pages, news articles, blogs, etc.).
This also is an opinionated way on how to handle anchors that I'm not sure I would like to impose on our customers. And I think this is also in line with @rofe's earlier comment.
I can look into generating the same identifiers that GitHub would do, but I would just leave these as id
s on the headings that project teams can adjust according to their needs either in the pre.js or post-load via JS (which is what I've seen most of the time). What do you think?
# Foo
=> <h1 id="foo">Foo</h1>
<- the default
=> <h1><a id="foo" href="#foo">link</a>Foo</h1>
<- project version
=> <a id="foo" href="#foo">link</a><h1>Foo</h1>
<- project version
etc.
We would propose the first by default, and the project team can decide what to do with it.
Good idea to do it like GitHub. They actually use some powerful advanced regex in python that properly supports all unicode blocks for foreign languages[0], but we don't have an easy JS equivalent at the moment. I see 2 possible approaches here:
I'm preparing a gist with as many examples as I can think of as test cases in [3].
why do you want to keep the unicode letters in the anchor name?
also see https://mathiasbynens.be/notes/es6-unicode-regex
i think that replace(/\W+/ug, “”)
would be enough
why do you want to keep the unicode letters in the anchor name?
The issue was about making it work like the github anchors, so if that is the actual requirement, I don't see how to achieve that result without a regex that fully supports all accented letters, but more importantly all foreign scripts as well. You can check the gist linked above for some examples.
replace(/\W+/ug, “”)
I tried that, as well as various combinations of \p{Script=…}
and it didn't work for me. Just simple accented characters are not even properly recognized by \W
and just stripped.
ES2018 is supposed to get proper unicode support in regex according to https://github.com/tc39/proposal-regexp-unicode-property-escapes, but until then it seems we are quite limited.
i think keeping the non ascii characters is overrated :-) but ok. if this is what we want :/)
I had an ascii-only version earlier on (1st commit actually)… I'm ok with any approach, really.
Now if we stick to the HTML5 spec, we should probably support unicode identifiers at some point, especially when we start supporting proper localization. Can't really have Chinese markdowns with Chinese headings and expect to transliterate to ascii identifiers I guess.
i agree. you still need to ensure that the anchors are unique.
The algorithm should ensure they are, even have unit tests for it.
If there is a collision, I append -1
, -2
, etc. like github does it, and if you have a collision between the 2nd "Foo" (foo-1
) and another header "Foo-1", then the latter becomes foo-1-1
as github does it
Sorry to interrupt, but why not just use lodash's kebabCase? https://lodash.com/docs/4.17.11#kebabCase
EDIT: It does have proper unicode support…
@koraa That is an excellent suggestion, didn't think about it. I tried playing with it on the gist I referenced and it actually performs really well for our needs.
Here is a comparative table of the various approaches. (I highlighted the differences from GitHub with ⚠️)
Heading | GitHub | XRegExp | KebabCase | Slugger |
---|---|---|---|---|
Latin script | latin-script | latin-script | latin-script | latin-script |
simpleAscii | simpleascii | simpleascii | ⚠️simple-ascii | ⚠️simple-ascii |
fo0 | fo0 | fo0 | ⚠️fo-0 | fo0 |
bar | bar | bar | bar | bar |
baz | baz | baz | baz | baz |
qux | qux | qux | qux | qux |
baz | baz-1 | baz-1 | baz-1 | baz-1 |
0corge | 0corge | 0corge | ⚠️0-corge | 0corge |
baz-1 | baz-1-1 | baz-1-1 | baz-1-1 | baz-1-1 |
Accented characters | accented-characters | accented-characters | accented-characters | accented-characters |
Fòô | fòô | fòô | ⚠️foo | fòô |
Bår | bår | bår | ⚠️bar-1 | bår |
Bãz | bãz | bãz | ⚠️baz-2 | bãz |
Qúx | qúx | qúx | ⚠️qux-1 | qúx |
Çorge | çorge | çorge | ⚠️corge | çorge |
Special#characters | specialcharacters | specialcharacters | ⚠️special-characters | ⚠️special-characters |
Punctuation - signs | punctuation---signs | punctuation---signs | ⚠️punctuation-signs | punctuation---signs |
Foo!! | foo | foo | ⚠️foo-1 | foo |
¿¿Bar?? | bar-1 | bar-1 | ⚠️bar-2 | ⚠️¿¿bar |
{Baz} | baz-2 | baz-2 | ⚠️baz-3 | baz-2 |
(Qux) | qux-1 | qux-1 | ⚠️qux-2 | qux-1 |
\ |
corge | corge | ⚠️corge-1 | corge |
-–—Grault—–- | -grault- | -grault- | ⚠️grault | -grault- |
---GARPLY--- | ---garply--- | ---garply--- | ⚠️garply | ---garply--- |
:;<=>?@ | ||||
_ForeignScripts | _foreignscripts | _foreignscripts | ⚠️foreign-scripts | _foreignscripts |
汉字 漢字 | 汉字-漢字 | 汉字-漢字 | 汉字-漢字 | 汉字-漢字 |
**العربية | العربية | العربية | ** العربية | iالعربية |
देवनागरी | देवनागरी | ⚠️दवनगर | देवनागरी | देवनागरी |
বাংলা-অসমীয়া | বাংলা-অসমীয়া | ⚠️বল-অসময | বাংলা-অসমীয়া | বাংলা-অসমীয়া |
Кириллица | кириллица | кириллица | кириллица | кириллица |
かな カナ | かな-カナ | かな-カナ | かな-カナ | かな-カナ |
한글 조선글 | 한글-조선글 | 한글-조선글 | 한글-조선글 | 한글-조선글 |
తెలుగు | తెలుగు | ⚠️తలగ | తెలుగు | తెలుగు |
தமிழ் | தமிழ் | ⚠️தமழ | தமிழ் | தமிழ் |
ગુજરાતી | ગુજરાતી | ⚠️ગજરત | ગુજરાતી | ગુજરાતી |
ಕನ್ನಡ | ಕನ್ನಡ | ⚠️ಕನನಡ | ಕನ್ನಡ | ಕನ್ನಡ |
မြန်မာ | မြန်မာ | ⚠️မနမ | မြန်မာ | မြန်မာ |
മലയാളം | മലയാളം | ⚠️മലയള | മലയാളം | മലയാളം |
ไทย | ไทย | ไทย | ไทย | ไทย |
ਗੁਰਮੁਖੀ | ਗੁਰਮੁਖੀ | ⚠️ਗਰਮਖ | ਗੁਰਮੁਖੀ | ਗੁਰਮੁਖੀ |
ລາວ | ລາວ | ລາວ | ລາວ | ລາວ |
ଉତ୍କଳ | ଉତ୍କଳ | ⚠️ଉତକଳ | ଉତ୍କଳ | ଉତ୍କଳ |
Ελληνικό | ελληνικό | ελληνικό | ελληνικό | ελληνικό |
Ελληνικό | ελληνικό-1 | ελληνικό-1 | ελληνικό-1 | ελληνικό-1 |
Maths | maths | maths | maths | maths |
E ≠ mc² | emc | ⚠️e--mc² | ⚠️e-≠-mc | ⚠️e-≠-mc² |
∇² = ±φ | --φ | ⚠️²--φ | ⚠️∇-φ | ⚠️∇²--±φ |
While lodash's kebabCase
does not exactly behave like GitHub, it does offer an easy way to cover the needs of an identifier. It even adds the capability to transform camel case headings into a -
-separated identifier. Also offers the benefit of not being something we have to maintain ourselves. So unless we actually have a real need to stay as close as possible to the GitHub behavior, I'd be tempted to go with @koraa's suggestion
I also notice that the :;<=>?@
heading generates no identifier in any of the solutions.
Should I go ahead and introduce a fallback to heading-1
or similar?
I would be interested in meeting the person that uses such a heading, so I say we leave it in case anyone wants to report that bug ;) But great edge case search! I am impressed claps
@koraa True… I initially thought I could provide a few examples from a mathematician's or physicist's point of view that would write some mathematical formula as heading title (E=mc², ∇² = φ, etc.) but even these actually work with kebabcase (all greek and maths symbols are supported)… so the only edge case I can think of that remains is just a gibberish sequence of special characters which you probably don't need to target anyway with an anchor.
:tada: This issue has been resolved in version 1.5.0 :tada:
The release is available on:
Your semantic-release bot :package::rocket:
:tada: This issue has been resolved in version 1.5.3 :tada:
The release is available on:
Your semantic-release bot :package::rocket:
:tada: This issue has been resolved in version 2.2.0 :tada:
The release is available on:
Your semantic-release bot :package::rocket:
Anchors are nicely managed on github.com: when rendering the markdown, github adds an anchor before each heading and shows an icon that gives the link to have a direct access to the corresponding heading. While working on our Helix documentation, I realised this would be a really useful feature, especially to handle the navigation: you can have a long md file but create a nav that give direct access to a "section" of content below in the page. Without the anchors, I had to split the long md file into various small md files to give direct access.
The pipeline could automatically generate an anchor (
<a>
tag with an id) for each "section" and "heading". Showing or not an icon would be delegated to the project and could be done with CSS.This issue was moved by trieloff from adobe/project-helix#154.