Make it easier to create book indexes in markdown

arthurattwell commented 7 years ago

From @arthurattwell on January 6, 2017 12:12

From @arthurattwell on October 7, 2016 11:36

From @arthurattwell on March 30, 2016 13:53

We want an easier way to create proper book indexes in markdown. Right now, we have to use a complex, onerous, clumsy system of links.

Book indexes and HTML

Traditionally, book indexes are entirely page-based. They are created by professional indexers, who work through a book once its pagination is final. They create a long list of index entries, each with page numbers. A great index is a work of literature in itself.

Because traditional indexes rely entirely on page numbers, they can only be created at the end of the book-production process. After they're created, pagination cannot change without breaking the index (and wasting lots of time and money). And indexes cannot be reused in reflowable ebooks (e.g. Amazon Kindle) where page numbering no longer applies. An ereader's search function is a very poor substitute for a well-crafted index, which might list themes and structure ideas in ways that search can't replicate.

An book index may include:

section headings (e.g. A, B, C to make long lists easily navigable)
entries
subentries
cross references (e.g. see, see also)

In HTML, a book index should be structured as an unordered list, so that it can be easily formatted with CSS.

Each entry and subentry is followed by one or more locators. While most locators refer to a specific point in text, some refer to a range, 'from point A to point B' (e.g. 'geek 5, 12–16').

In PDF output, locators should be page numbers (in our case page numbers are generated from links by PrinceXML upon HTML-to-PDF conversion). On screen (ebooks and web), the locators should be a sequence of numbers from 1, each linking to a point in the book's text (e.g. where print might have 'geek 5, 12–16' in its page-based index, an ebook might have 'geek 1, 2–3', where the numbers only refer to the order in which geeks appear in the text.)

So in our markdown, we will need to do two things:

Place anchors in running text, using as few characters as possible. We do not want the anchor syntax to make it hard to read and edit the markdown.
Manually structure the index, while benefitting from automation to some degree.

Existing standards

The IDPF has published recommendations on indexes in ebooks.

Using existing syntax

If we build a plugin for this, ideally we should use syntax that has no effect on kramdown processors without the plugin. This might not be possible.

For instance, we can already create an empty anchor tag with ID foo:

[ ](){:#foo}

or we can put an HTML-island anchor tag in the text:

<a id="foo"></a>

But these require a lot of characters, as you can see in our workflow's existing hack for indexes.

We want something much less intrusive to writers, editors and readers of the markdown; something more like markdown's footnote syntax.

Using a variation on footnote syntax

One option is to develop a plugin that offers a variation of markdown's footnote syntax. For instance, where a footnote uses [^1] to place a footnote reference, we might use and reuse [@1] to place an index locator. Index entries would then be defined in a separate list, which constitutes the text of the index. We'd tell the plugin to convert that list by tagging it with something like {:index}:

*   Churchill, Winston @1
*   Lincoln, Abraham @2
*   Mandela, Nelson @3
{:index}

The converter would assign a sequence of IDs to each instance of, say, [@1], and insert links to those after each entry in the index. Each locator link is sequentially numbered. (We can replace those numbers with page numbers on HTML-to-PDF output using Prince and CSS.)

In the markdown, this would be very efficient space-wise. But it would require indexers to maintain a long list of entries with numerical references. Once an index gets to several hundred references, this might be difficult.

Footnote references like this don't have to be numerals, though. One can also use [^foo], making a long list of entries easier to maintain. E.g.:

*   Churchill, Winston @chur
*   Lincoln, Abraham @linc
*   Mandela, Nelson @mand

A graceful fallback

Using something like [@1], of course, would leave these strings as literals in kramdown instances without our plugin – like GitHub Pages. We might instead use existing footnote syntax, but hijack it with a dedicated character or sequence that is also a valid footnote reference, e.g. [^x-1].

The index then might look like this in markdown:

One famous leader is [^x-chur]Winston Churchill. Others are [^x-mand]Nelson Mandela and [^x-linc]Abraham Lincoln. I like [^x-mand]Mandela most.

[^x-chur]: Churchill, Winston
[^x-linc]: Lincoln, Abraham
[^x-mand]: Mandela, Nelson

Where we're not using the index plugin, these could be hidden with CSS using, for instance:

a.footnote[href*="x-"] {
    display: none;
}
.footnotes li[id*="x-"] {
    display: none;
}

If you put that code snippet into a kramdown converter, you'll see that the output is already close to what we need.

That's our graceful fallback. It also means we're close to autogenerating a basic index as a list of terms.

For more complex, manually structured indexes, we would give our converter a separate markdown list to process, where our footnote IDs are turned into HTML links, e.g.:

*   Churchill, Winston #chur
*   Lincoln, Abraham #linc
*   Mandela, Nelson #mand
{:.index}

(Again, you'd hide this list, targeting .index with CSS, where our plugin isn't available and you need a graceful fallback.)

Ranges of locators

The challenge then is how to create ranges of locators (e.g. page ranges). For instance, if the index lists a theme that is covered over a number of pages, how do we specify that the theme is referenced not just at the first point in that passage, but throughout? Traditionally, an index might list 'politicians 12–13'.

For this we need to specify and distinguish between start and end tags. For instance, we might add a hyphen to the end of tag IDs that start a range, and use the tag without a hyphen to end the range:

[^x-poli-]One famous leader is [^x-chur]Winston Churchill. Others are [^x-mand]Nelson Mandela and [^x-linc]Abraham Lincoln.[^x-poli]

*   politicians #poli
    *   Churchill, Winston #chur
    *   Lincoln, Abraham #linc
    *   Mandela, Nelson #mand

The converter would then, for #poli, insert references to both the start and end locators, joined with an en dash: 'politicians 12–13'.

A separate difficulty will be to avoid cases in PDF output where a range of locators fall on one page, resulting in an index entry like 'politicians 12–12'. For that, we may need a separate solution, possibly using Javascript for Prince that identifies duplicate references in page ranges and replaces them with single page references.

Unworkable alternatives

Some syntax looks like it might work, but actually wouldn't. To save us rehashing these approaches I'll note them here.

Anchor-tag-like syntax

One option is to make kramdown turn this into an anchor tag:

{:#foo-1}

This is very close to the existing markdown for adding an inline attribute list (IAL). So if it works and is unlikely to break existing markdown (except in rare cases where someone actually used that pattern as a literal string), we could even submit it as a pull request to kramdown.

While this syntax could be fairly intrusive on running text, it would make it easy for indexers to remember and reuse IDs as they work. (For instance, a hundred pages after last using '{:#nelson-mandela}' they'd still remember it when adding another anchor to the text.)

The indexer would then maintain a separate index using these IDs as values:

One famous leader is {:#churchill-1} Winston Churchill. Others are {:#mandela-1} Nelson Mandela and Abraham Lincoln {:#lincoln-1}.

*   Churchill, Winston: #churchill-1
*   Lincoln, Abraham: #lincoln-1
*   Mandela, Nelson: #mandela-1
{:index}

Note that the locator tag is before the relevant text, not after it. This is because if the locator tag becomes an HTML anchor, a hyperlink pointing to it must take the user to the start of the relevant text, not the end.

The challenge then is how to create ranges of locators (e.g. page ranges). For instance, if the index lists a theme that is covered over a number of pages, how do we specify that the theme is referenced not just at the first point in that passage, but throughout? Traditionally, an index would say 'politicians 12–19'.

For this we need to specify and distinguish between start and end tags. For instance, we might add a hyphen to the end of tag IDs that start a range:

{:#politicians-1-} One famous leader is {:#churchill} Winston Churchill. Others are {:#mandela} Nelson Mandela and Abraham Lincoln {:#lincoln}.{:#politicians-1}

*   politicians: #politicians
*   Churchill, Winston: #churchill
*   Lincoln, Abraham: #lincoln
*   Mandela, Nelson: #mandela
{:index}

Our converter would then add two locator links in the resulting HTML. One pointing to the first #politicians-1- anchor, and the second to #politicians-1.

However, this syntax gets clumsy, and requires indexers to track the unique IDs. So this does not seem workable.

A new extension

Kramdown ships with three extension tags: comment, nomarkdown, and options.

We could add an extension for an index location, with similar behaviour to a comment. The main advantage of this approach is that extensions can have a start and an end tag. So tags with no end tag specify a point in the text, where those with an end tag create a range.

For instance:

One famous leader is {::index "Churchill, Winston" /}Winston Churchill. Another is {::index "Mandela, Nelson"}Nelson Mandela{:/}.

The main disadvantages of this is that (a) it isn't as brief and non-intrusive as we'd like, and (b) correctly nesting indexed ranges will be tricky. For instance, an entire page might be wrapped in an entry like 'politicians', with single names on that page wrapper in their own entries.

Reference-style links

Kramdown lets you create links in inline style and reference style. It's tempting to use and adapt reference-style links to place index locators. For instance, where a normal reference-style link looks like this:

One famous leader is [Winston Churchill]. Another is [Nelson Mandela].

[Winston Churchill]: https://en.wikipedia.org/wiki/Winston_Churchill
[Nelson Mandela]: https://en.wikipedia.org/wiki/Nelson_Mandela

We might use

One famous leader is [Winston Churchill]. Another is [Nelson Mandela].

{:index}
[Winston Churchill]
[Nelson Mandela]

However, this would need to take into account the fact that some index references are actual links as well. E.g. if you really did want those names to be hyperlinks and to be in the index, you might need:

One famous leader is [Winston Churchill]. Another is [Nelson Mandela].

[Winston Churchill]: https://en.wikipedia.org/wiki/Winston_Churchill
[Nelson Mandela]: https://en.wikipedia.org/wiki/Nelson_Mandela

{:index}
[Winston Churchill]
[Nelson Mandela]

Also, they might overlap, and you can't have links within links:

[One famous leader is [Winston Churchill]](https://en.wikipedia.org/wiki/Winston_Churchill). Another is [Nelson Mandela].

{:index}
[Winston Churchill]
[Nelson Mandela]

The potential clashes with actual links make this unworkable.

Copied from original issue: electricbookworks/electric-book-workflow#9

Copied from original issue: electricbookworks/electric-book#17

Copied from original issue: electricbookworks/electric-book-workflow#38

arthurattwell commented 7 years ago

I'm looking into a way to use kramdown abbr syntax to create links to probably-unique phrases, which can be indexed.

arthurattwell commented 7 years ago

When we clean up an index looking for duplicate page numbers and ranges of page numbers, this thread will be very important. This script may also be valuable.

arthurattwell commented 7 years ago

This approach uses text fragmentions as identifiers.

arthurattwell commented 3 years ago

I've made good progress on this, for discussion in #547.

electricbookworks / electric-book