getzola / zola

A fast static site generator in a single binary with everything built-in. https://www.getzola.org
https://www.getzola.org
MIT License
13.4k stars 938 forks source link

`linenos` syntax highlighting is a semantic mess #2586

Open clarfonthey opened 1 month ago

clarfonthey commented 1 month ago

Bug Report

Environment

Zola version: 0.19.1

Expected Behavior

Adding line numbers should not substantially affect the semantic markup of a code block. In particular, it shouldn't substantially affect the accessibility tree and should "just work" when the user highlights the code.

Current Behavior

Right now, all code blocks output the code within a <pre> tag, then a <code> tag. If linenos is not added, then it simply outputs the code as expected with <span>s wrapping around syntax highlighting elements-- these <spans> lack semantics and are effectively ignored from the accessibility tree, and without any extra CSS, they do nothing, meaning that the code blocks will always work as expected, and highlighting works correctly.

However, one curious thing is worth mentioning: when you use highlight_theme = "css" instead of any predefined highlighting theme, every single line is guaranteed to be wrapped in an extra span, effectively ensuring that each line can be individually marked with CSS. If you wanted to, for example, highlight the lines with alternating stripes, you could do this with CSS. It's understandable that this isn't part of the default theme, however, since generally, you want this to be part of the theme instead of a separate style.

The output changes completely when line numbers are enabled via linenos. Instead of outputting the code directly, it outputs it in a table with the line numbers on the left column and the code on the right column. This changes the semantic meaning of the code block in the accessibility tree, meaning it is read differently from screen readers.

In addition to the table affecting the accessibility tree, it also affects selecting and copying the text. The default recommendation is to add user-select: none, which prevents the lines from being included in any copy-pasting, but it still results in a lot of excess newlines in the copied output due to the fact that each line is terminated both by a newline in the <pre> block and by the end of a table row.

Solution Sketch

This is just one solution I came up with that doesn't involve a whole lot of changes, but I don't want to be too prescriptive.

Essentially, the highlighter should always wrap lines in a separate tag regardless of output mode, so that CSS can be certain that pre > code > * always refers to a line. If the line is marked, this will be a <mark> tag, but if not, it'll just be a <span>.

Then, instead of outputting line numbers in a table cell, they're added as data attributes to the tag for the line. So, for example, a highlighted line 2 would be indicated with <mark data-lineno="2">. This has the downside that the line numbers require additional CSS to be printed, with the upside that this results in them being excluded from user selection by default.

I decided to make some mockups of how this would look on a page, using a few options depending on browser support:

CSS subgrids (~89% coverage)

<style>
pre[data-linenos] {
    white-space: normal;
}
pre[data-linenos] > code {
    display: grid;
    grid-template-columns: min-content 1fr;
}
pre > code > * {
    white-space: pre;
    display: grid;
    grid-column: span 2;
    grid-template-columns: subgrid;
    grid-template-rows: subgrid;
}
pre > code > *::before {
    content: attr(data-lineno);
    padding-right: 1ch;
    user-select: none;
}
</style>
<pre data-linenos><code><mark data-lineno=1>let code = true;</mark>
<span data-lineno=20>let other_code = false;</span></pre>

This is the ideal way of displaying the line numbers IMHO, with the caveat that it's not possible on a few older browsers due to lack of subgrid support. In particular, Chrome only got support in late 2023 and Safari only got support in 2022, and the combined usage from both old versions of Chrome + old iOS users accounts for almost all of the ~10% of usage.

Excess padding

<style>
pre > code > * {
    display: block;
}
pre > code > *::before {
    content: attr(data-lineno);
    display: inline-block;
    min-width: 6ch;
    padding-right: 1ch;
    user-select: none;
    text-align: right;
}
</style>
<pre data-linenos><code><mark data-lineno=1>let code = true;</mark>
<span data-lineno=20>let other_code = false;</span></pre>

This is the compromise position: just make the line number a large enough width that it's definitely big enough. Most people will be totally fine with this compromise.

Summary

Note that there are a few common things in my solutions worth pointing out:

  1. In order to ensure that extra lines aren't added due to the presence of table rows or block elements, the pre element must have white-space: normal and the individual lines have the white-space: pre instead. This ensures that the newlines aren't double-counted.
  2. The ch unit is technically not supported by IE and super-old Webkit, which does account for a nontrivial amount of people. However, in my case it's only used for sizing in a way that scales with font size and using some other unit is fine.
  3. content: attr() is pretty much universally supported by browsers, so, no worries there.

It's also worth noting that this formulation also lets us include line numbers on a variable basis: a user could make their own dynamic elements which toggle on and off the line numbers without affecting the surrounding content. We could even decide to output data-lineno by default and then just let the user only show them if data-linenos is added to the pre tag.

Keats commented 1 month ago

The advantage of the current approach is that it kinda works without any CSS, which was one of the goal. I don't know if that's a huge deal or not in practice.

clarfonthey commented 1 month ago

I mean, that's fair; my opinion is that the more important failure state is being able to copy and paste the code correctly by default. I'll poke around and see what the precedence is for line number marking on different systems and see whether my solution actually has any precedent to it.

Keats commented 1 month ago

It does seem to copy ok except for a blank line between each line, is that the issee you're seeing?

clarfonthey commented 1 month ago

That is one of them: the solution to this is to style the individual td as white-space: pre and the outer pre as white-space: normal; since the table rows implicitly add a newline, they're duplicated by the pre and have to be removed.

Although honestly, the more important issue is the way it's exposed to the accessibility tree, and read out to screen readers. I asked a few people about this and the general impression was that in all cases, a table is a bad way to interact with code, and although line numbers generally get in the way for screen readers, the <span> representation I suggest is an okay alternative. It also appears to be the way Firefox does its source view, so, that's at least some precedent for it.

Maybe if you want to preserve the styling by default, you could output a <span class="lineno"> at the beginning of each line instead of relying on CSS to emit it? You could even make them line up properly by emitting the right amount of whitespace after them, so that folks don't need to rely on CSS to get the table layout right. (Although extra CSS will probably be desired anyway.) This seems like the best compromise since it also allows toggling the line numbers with CSS, which I'm sure some people might like to add.

Keats commented 1 month ago

Let's imagine we can change the HTML and require some CSS changes. What would be the best representation after your research? The one or something similar?

clarfonthey commented 1 month ago

So far, my preference leans toward just emitting the line numbers inside a <span> at the beginning of the line, with them properly aligned using extra whitespace. You want to right-align the numbers so they are all the same width, but ensure an extra space is between the line number and the code.

If the user is explicitly outputting inline CSS for their theme, we should just add in a user-select: none style on the span directly for the line numbers, otherwise we should emit it as part of CSS themes. It should probably always have a class, though.

This feels like the best way to make it work with minimal CSS, while still allowing it to be as flexible as possible for styling. Also, you can remove the whitespace alignment and realign it yourself by marking the line numbers as white-space: normal, since that would remove the whitespace before them.

I have been mostly sitting on this but I'm okay writing up the code/docs changes for this. Would probably want to be a bit careful and add an option to keep the old generation for a bit, since folks would have to migrate their stylesheets.

Since zola is technically pre-1.0, this could just be a regular breaking change, but it feels like it'd be nicest to offer that option.

Keats commented 1 month ago

I think it would be fine to do as a breaking change as long as we have a big breaking change notice in the changelog if they have to make a change anyway to their site

clarfonthey commented 1 month ago

In that case, I'll try and set up a change this weekend that just converts over the representation, along with the relevant CSS changes to the docs & recommendations.

clarfonthey commented 1 month ago

Realising another big issue here is the way syntect handles line formatting: technically, splitting lines apart is unsupported by the current syntaxes, since syntax-formatting spans can stay opened across lines. May add in some custom code to just, close tags and then reopen them at the next line to make it work, but going to be slightly more complicated than I thought.

Probably another reason to want to move away from syntect tbqh, but for now I'll just try and make it work.

Keats commented 1 month ago

If it's too much work, we can always delay it until we swap.

clarfonthey commented 1 month ago

I'm not actually sure if this problem would be necessarily fixed in tree-sitter either; it might just have better support for tracking these differences across lines since it's also designed to work in editors.

Shouldn't be too hard to track spans being opened and closed anyway, just needs some extra code.