Proposal - redesign how Elixir adds links to source code formatted by Pygments

I'm wondering if it wouldn't be a good idea to redesign how Elixir turns identifiers into links in formatted source code. Current approach is kind of hacky and some bugs found in it seem rather hard to fix.

How filters work right now

tokenize-file is a (bash + perl regex) tokenizer that splits a source file into lines. Every other line is (supposed to be) a potential identifier. The rest are supposed to be comments, operators, generally not interesting.
The result of the tokenizer is consumed by query.py. Every other line is looked up in the definitions database. If a match is found, unprintable markings are added at the beginning and end of the line.
Lines are then concatenated back into a string
Filters are used to find all interesting pieces of code, like identifiers and includes, and replace them with links. Filters consist of two parts. The first part runs on unformatted code, the second part runs in the HTML produced by Pygments. Here, the first part is ran and it most cases it replaces matching identifiers with special identifiers generated by filters.
The string is formatted into HTML by Pygments
The second part of filters is ran on formatted HTML, replacing identifiers replaced in original source with HTML.

Some more notes:

Current filters design allows them to only replace sections of source that can be matched by a regex (this applies to both parts of a filter)
ident is used to find and replace identifiers marked by file cmd in query.py in most files. In Kconfig marked identifiers are replaced by kconfigidents, in DTS by dtscompc
tokenize-file is also used in update.py.

Problems with this approach

Filters have to make sure to not break formatting.
Filters have to trust Pygments to not split their special identifiers into unregexable parts.
A single regex (and some bash) is used to tokenize source code for most languages. Although there is a special tokenizer regex, just for DTS. If we want to fix #291, we probably would have to add another lexer there, making the script aware of another language. #306 also needs to be fixed there.
Elixir has to be aware of the programming language in more parts than necessary. Not only in the script and the filters, there is also another hard-coded language-dependent exception in the Python part of the tokenizer.

While of course a some of this could be fixed while leaving the filters design mostly the same, I started thinking if there maybe is a better way to design this whole part.

Alternative design with Pygments

First, something that I want to be sure you noticed - the responsibility of filters seems to be to work around a missing functionality of Pygments. Pygments currently does not allow adding links to selected tokens (except if you generate a ctags file, but that's another hack, please, let's not do that).

My idea is simple: Pygments is split into lexers and formatters - the lexer turns source into a sequence of tokens. The formatter then takes the tokens and turns them into formatted code. Pygments also supports filters - functions that take a sequence of tokens and transform it in some way. This could be leveraged, unfortunately Pygments currently does not allow filters to add link information to tokens.

What I would like to do:

Try to implement and upstream a functionality in Pygments that would allow filters to add certain, formatter dependent, information to tokens. People have been asking for this already (https://github.com/pygments/pygments/issues/1930), the maintainer agrees that this would be useful, and I think I have a design that could lead to something that gets merged.
If my changes get merged, reimplement current double search-and-replace filters as Pygments filters. Tokenizing would then be done by Pygments, adding link information to tokens (including the database lookup) would be contained entirely in filters.

This should lead to less and simpler code - for example, filters won't have to track what they replaced in the unformatted source. Filters will also receive more information which in the future could allow them to, for example, track context to supply links with more information. And, less regex (in Elixir at least).

Possible problems

A problem I see so far is that Pygments treats each macro line as a single token. Macros are easily recognizable because they use a separate token type. So I think this could be handled by a filter that would lex all macros into smaller tokens and supply them with identifier information. I realize that this hits into some of my points, but I still think this approach is better - most code would be handled by the cleaner codepath. And also, support for identifiers in macros is also already very hacky at best - maybe this would allow us to improve something, since the job of recognizing macros will now be done for us.

First of all, that's all great stuff. Intuitively, this looks like the right direction forward.

Pygments currently does not allow adding links to selected tokens (except if you generate a ctags file, but that's another hack, please, let's not do that).

Can you expand on that? If pygments exposes an interface to add links to tokens (using a specific format, ctags or whatever), then, naively, it sounds like what we need?

From the threads (pygments/pygments#1930 and pygments/pygments#1091), people mostly ask about generating anchors that match the definition/reference name. That is a bit different from our usecase that is:

web.py: to have our own code generate custom HTML (only <a> tags?) for each symbol.
update.py: to generate a list of symbols.

Is the second usecase already supported? This seems pretty unrelated to link injection. Will we be parsing pygments output tree to find our symbols list?

Could we contribute a change to pygments for it to try to handle the body of macros?

Pygments currently does not allow adding links to selected tokens (except if you generate a ctags file, but that's another hack, please, let's not do that).

Can you expand on that? If pygments exposes an interface to add links to tokens (using a specific format, ctags or whatever), then, naively, it sounds like what we need?

See tagsfile here.

When tagsfile is set to the path of a ctags index file, it is used to generate hyperlinks from names to their definition. (...). The python-ctags module from PyPI must be installed to use this feature; otherwise a RuntimeError will be raised.

We would have to generate a ctags index file with information about references in the processed file (and then save it somewhere in the filesystem, but that probably could be worked around, maybe it would be smarter if we would do that from update.py). I think there is a reason why ctags is not used for finding references, and that reason probably is references in macros. If that's the case, then we would have to generate that file, or use ctags and extend what it generates with new information. Or extend ctags to handle all references that are interesting to us. It's generating ctags files that I consider hacky, unless we use ctags of course.

From the threads (https://github.com/pygments/pygments/issues/1930 and https://github.com/pygments/pygments/issues/1091), people mostly ask about generating anchors that match the definition/reference name

I think both issues mention both anchors and links? But regardless, if we have a usecase, then I don't think existing issues matter that much, unless there are any against this feature. And also, adding a functionality that allows users to add links/anchor information to any tokens would cover a more specific usecase of "generating anchors that match the definition/reference name".

update.py: to generate a list of symbols.

Is the second usecase already supported? This seems pretty unrelated to link injection. Will we be parsing pygments output tree to find our symbols list?

Yes, we would use token list generated by Pygments to find references in the code. We could alternatively stay with existing tokenizers just for update.py, but I just don't see the point.

Could we contribute a change to pygments for it to try to handle the body of macros?

Maybe, splitting macros into more tokens shouldn't affect formatting, it would just generate more tokens.

See tagsfile here.

Ah yes. And it doesn't give a way to configure what the links should look like, it is only describing generating anchors (ie putting title attributes). It is not a solution for our need.

I think both issues mention both anchors and links? But regardless, if we have a usecase, then I don't think existing issues matter that much, unless there are any against this feature. And also, adding a functionality that allows users to add links/anchor information to any tokens would cover a more specific usecase of "generating anchors that match the definition/reference name".

Yes, we'll see what the maintainers think of it. Indeed our need is more generic and so should fill the void that the issues want.

update.py: to generate a list of symbols. Is the second usecase already supported? This seems pretty unrelated to link injection. Will we be parsing pygments output tree to find our symbols list?

Yes, we would use token list generated by Pygments to find references in the code. We could alternatively stay with existing tokenizers just for update.py, but I just don't see the point.

The nice thing is that working on this does not require any pygments patch.

Could we contribute a change to pygments for it to try to handle the body of macros?

Maybe, splitting macros into more tokens shouldn't affect formatting, it would just generate more tokens.

It would (1) give us our desired symbols inside macros and (2) maybe improve syntax highlighting for the (probable) majority of macros that can be parsed. That requires a really lean lexer however, I don't know what pygments internals look like.

See tagsfile here.

Ah yes. And it doesn't give a way to configure what the links should look like, it is only describing generating anchors (ie putting title attributes). It is not a solution for our need.

You actually can configure link format, see tagurlformat. So maybe we could consider ctags indexes. But it seems that ctags itself (still) does not fully support indexing references. Or maybe I did something wrong, I didn't try hard enough to be sure, see ctags docs. ctags -o - --extras=+g+r --fields=+r drivers/accessibility/braille/braille_console.c only adds includes to output without +r. But still, if ctags support for references is not satisfying, Elixir would have to generate ctags indexes itself, and I don't like that approach.

Yes, we would use token list generated by Pygments to find references in the code. We could alternatively stay with existing tokenizers just for update.py, but I just don't see the point.

The nice thing is that working on this does not require any pygments patch.

Maybe it would actually be a good idea to start there, to see if updates that use Pygments lexers to find references give satisfying results.

It would (1) give us our desired symbols inside macros and (2) maybe improve syntax highlighting for the (probable) majority of macros that can be parsed. That requires a really lean lexer however, I don't know what pygments internals look like.

I think we agree, I skipped a few steps there - what I meant is that I think that if a lexer would generate more tokens for a macro, as long as nothing was skipped and token types were correct, highlighting wouldn't change (and yes, someone could use these new tokens to improve macro highlighting).

bootlin / elixir