Add support for anchors on headings

ghost commented 6 years ago

Anchors are nicely managed on github.com: when rendering the markdown, github adds an anchor before each heading and shows an icon that gives the link to have a direct access to the corresponding heading. While working on our Helix documentation, I realised this would be a really useful feature, especially to handle the navigation: you can have a long md file but create a nav that give direct access to a "section" of content below in the page. Without the anchors, I had to split the long md file into various small md files to give direct access.

The pipeline could automatically generate an anchor (<a> tag with an id) for each "section" and "heading". Showing or not an icon would be delegated to the project and could be done with CSS.

This issue was moved by trieloff from adobe/project-helix#154.

rofe commented 5 years ago

No need for an extra <a> tag, you could add the id attribute directly to the heading tag.

ramboz commented 5 years ago

Looking into this, a few questions come to mind:

How do we want to sanitize the title so the identifier is both user-friendly and still valid according to the html spec? My first idea was to just replace any non-alphanumerical character with a dash and turn everything to lowercase. What’s next? would be transformed to what-s-next-

Do we want to keep sanitized characters at the start/end of the identifier? ¿que pasa? => -que-pasa- or que-pasa?

Do we want to keep repeating dashed in the identifier? Helix - test => helix---test or helix-test?

Identifiers have to be unique, so how would we want to handle 2 same labeled H2s in different H1s? I would propose to prefix with the parent H1 identifier. Something like:

h1(id="foo") Foo
  h2(id="foo--baz") Baz
h1(id="bar") Bar
  h2(id="bar--baz") Baz

Are we targeting HTML 4 or HTML4.1+? The former was quite strict about identifiers (only [a-zA-Z][a-zA-Z0-9_-.:]+ if I remember), while the latter supports anything that is not a space (\s+). I'd be tempted to go with an HTML4 compatible id to be backward compatible, but would also be ok to relax the rule

kptdobe commented 5 years ago

Good questions ;) I think we should just do it exactly like Github does it. I assume this will answer all the questions, assuming they support each cases. By default, I see they do not prefix the h2 with h1 but could not quickly find a duplicate case.

Also, github.com adds a <a> tag with the id mainly to be able to have it visual and be able to copy/paste the link easily. But default, this link could be hidden.

ramboz commented 5 years ago

@kptdobe, @rofe: so GitHub only uses the <a> because they actually make a visible link on the header so you can easily copy-paste. While this is usually a great pattern for documentation-related websites and wikis, I've rarely seen this used anywhere else (i.e. product pages, news articles, blogs, etc.).

This also is an opinionated way on how to handle anchors that I'm not sure I would like to impose on our customers. And I think this is also in line with @rofe's earlier comment.

I can look into generating the same identifiers that GitHub would do, but I would just leave these as ids on the headings that project teams can adjust according to their needs either in the pre.js or post-load via JS (which is what I've seen most of the time). What do you think?

# Foo

=> <h1 id="foo">Foo</h1> <- the default => <h1><a id="foo" href="#foo">link</a>Foo</h1> <- project version => <a id="foo" href="#foo">link</a><h1>Foo</h1> <- project version etc.

We would propose the first by default, and the project team can decide what to do with it.

ramboz commented 5 years ago

Good idea to do it like GitHub. They actually use some powerful advanced regex in python that properly supports all unicode blocks for foreign languages[0], but we don't have an easy JS equivalent at the moment. I see 2 possible approaches here:

use XRegExp[1] and its support for unicode categories to simplifiy the regexp, but we introduce a depdendency
have an exhaustive regexp in vanilla JS using the unicode blocks that represent valid characters in each script (language). Just to give an idea, the regex for just the Chinese characters is given in [2]

I'm preparing a gist with as many examples as I can think of as test cases in [3].

tripodsan commented 5 years ago

why do you want to keep the unicode letters in the anchor name?

also see https://mathiasbynens.be/notes/es6-unicode-regex

i think that replace(/\W+/ug, “”) would be enough

ramboz commented 5 years ago

why do you want to keep the unicode letters in the anchor name?

The issue was about making it work like the github anchors, so if that is the actual requirement, I don't see how to achieve that result without a regex that fully supports all accented letters, but more importantly all foreign scripts as well. You can check the gist linked above for some examples.

replace(/\W+/ug, “”)

I tried that, as well as various combinations of \p{Script=…} and it didn't work for me. Just simple accented characters are not even properly recognized by \W and just stripped.

ES2018 is supposed to get proper unicode support in regex according to https://github.com/tc39/proposal-regexp-unicode-property-escapes, but until then it seems we are quite limited.

tripodsan commented 5 years ago

i think keeping the non ascii characters is overrated :-) but ok. if this is what we want :/)

ramboz commented 5 years ago

I had an ascii-only version earlier on (1st commit actually)… I'm ok with any approach, really.

Now if we stick to the HTML5 spec, we should probably support unicode identifiers at some point, especially when we start supporting proper localization. Can't really have Chinese markdowns with Chinese headings and expect to transliterate to ascii identifiers I guess.

tripodsan commented 5 years ago

i agree. you still need to ensure that the anchors are unique.

ramboz commented 5 years ago

The algorithm should ensure they are, even have unit tests for it. If there is a collision, I append -1, -2, etc. like github does it, and if you have a collision between the 2nd "Foo" (foo-1) and another header "Foo-1", then the latter becomes foo-1-1 as github does it

koraa commented 5 years ago

Sorry to interrupt, but why not just use lodash's kebabCase? https://lodash.com/docs/4.17.11#kebabCase

EDIT: It does have proper unicode support…

ramboz commented 5 years ago

@koraa That is an excellent suggestion, didn't think about it. I tried playing with it on the gist I referenced and it actually performs really well for our needs.

Here is a comparative table of the various approaches. (I highlighted the differences from GitHub with ⚠️)

Heading	GitHub	XRegExp	KebabCase	Slugger
Latin script	latin-script	latin-script	latin-script	latin-script
simpleAscii	simpleascii	simpleascii	⚠️simple-ascii	⚠️simple-ascii
fo0	fo0	fo0	⚠️fo-0	fo0
bar	bar	bar	bar	bar
baz	baz	baz	baz	baz
qux	qux	qux	qux	qux
baz	baz-1	baz-1	baz-1	baz-1
0corge	0corge	0corge	⚠️0-corge	0corge
baz-1	baz-1-1	baz-1-1	baz-1-1	baz-1-1
Accented characters	accented-characters	accented-characters	accented-characters	accented-characters
Fòô	fòô	fòô	⚠️foo	fòô
Bår	bår	bår	⚠️bar-1	bår
Bãz	bãz	bãz	⚠️baz-2	bãz
Qúx	qúx	qúx	⚠️qux-1	qúx
Çorge	çorge	çorge	⚠️corge	çorge
Special#characters	specialcharacters	specialcharacters	⚠️special-characters	⚠️special-characters
Punctuation - signs	punctuation---signs	punctuation---signs	⚠️punctuation-signs	punctuation---signs
Foo!!	foo	foo	⚠️foo-1	foo
¿¿Bar??	bar-1	bar-1	⚠️bar-2	⚠️¿¿bar
{Baz}	baz-2	baz-2	⚠️baz-3	baz-2
(Qux)	qux-1	qux-1	⚠️qux-2	qux-1
\	corge	corge	⚠️corge-1	corge
-–—Grault—–-	-grault-	-grault-	⚠️grault	-grault-
---GARPLY---	---garply---	---garply---	⚠️garply	---garply---
:;<=>?@
_ForeignScripts	_foreignscripts	_foreignscripts	⚠️foreign-scripts	_foreignscripts
汉字漢字	汉字-漢字	汉字-漢字	汉字-漢字	汉字-漢字
**العربية	العربية	العربية	** العربية	iالعربية
देवनागरी	देवनागरी	⚠️दवनगर	देवनागरी	देवनागरी
বাংলা-অসমীয়া	বাংলা-অসমীয়া	⚠️বল-অসময	বাংলা-অসমীয়া	বাংলা-অসমীয়া
Кириллица	кириллица	кириллица	кириллица	кириллица
かなカナ	かな-カナ	かな-カナ	かな-カナ	かな-カナ
한글 조선글	한글-조선글	한글-조선글	한글-조선글	한글-조선글
తెలుగు	తెలుగు	⚠️తలగ	తెలుగు	తెలుగు
தமிழ்	தமிழ்	⚠️தமழ	தமிழ்	தமிழ்
ગુજરાતી	ગુજરાતી	⚠️ગજરત	ગુજરાતી	ગુજરાતી
ಕನ್ನಡ	ಕನ್ನಡ	⚠️ಕನನಡ	ಕನ್ನಡ	ಕನ್ನಡ
မြန်မာ	မြန်မာ	⚠️မနမ	မြန်မာ	မြန်မာ
മലയാളം	മലയാളം	⚠️മലയള	മലയാളം	മലയാളം
ไทย	ไทย	ไทย	ไทย	ไทย
ਗੁਰਮੁਖੀ	ਗੁਰਮੁਖੀ	⚠️ਗਰਮਖ	ਗੁਰਮੁਖੀ	ਗੁਰਮੁਖੀ
ລາວ	ລາວ	ລາວ	ລາວ	ລາວ
ଉତ୍କଳ	ଉତ୍କଳ	⚠️ଉତକଳ	ଉତ୍କଳ	ଉତ୍କଳ
Ελληνικό	ελληνικό	ελληνικό	ελληνικό	ελληνικό
Ελληνικό	ελληνικό-1	ελληνικό-1	ελληνικό-1	ελληνικό-1
Maths	maths	maths	maths	maths
E ≠ mc²	emc	⚠️e--mc²	⚠️e-≠-mc	⚠️e-≠-mc²
∇² = ±φ	--φ	⚠️²--φ	⚠️∇-φ	⚠️∇²--±φ

While lodash's kebabCase does not exactly behave like GitHub, it does offer an easy way to cover the needs of an identifier. It even adds the capability to transform camel case headings into a --separated identifier. Also offers the benefit of not being something we have to maintain ourselves. So unless we actually have a real need to stay as close as possible to the GitHub behavior, I'd be tempted to go with @koraa's suggestion

ramboz commented 5 years ago

I also notice that the :;<=>?@ heading generates no identifier in any of the solutions. Should I go ahead and introduce a fallback to heading-1 or similar?

koraa commented 5 years ago

I would be interested in meeting the person that uses such a heading, so I say we leave it in case anyone wants to report that bug ;) But great edge case search! I am impressed claps

ramboz commented 5 years ago

@koraa True… I initially thought I could provide a few examples from a mathematician's or physicist's point of view that would write some mathematical formula as heading title (E=mc², ∇² = φ, etc.) but even these actually work with kebabcase (all greek and maths symbols are supported)… so the only edge case I can think of that remains is just a gibberish sequence of special characters which you probably don't need to target anyway with an anchor.

adobe-bot commented 5 years ago

:tada: This issue has been resolved in version 1.5.0 :tada:

The release is available on:

Your semantic-release bot :package::rocket:

adobe-bot commented 5 years ago

:tada: This issue has been resolved in version 1.5.3 :tada:

The release is available on:

Your semantic-release bot :package::rocket:

adobe-bot commented 5 years ago

:tada: This issue has been resolved in version 2.2.0 :tada:

The release is available on:

Your semantic-release bot :package::rocket:

adobe / helix-pipeline

Add support for anchors on headings #26