Variables are not considered tokens?

jameshfisher commented 4 years ago

Information

Language: seemingly all languages?
Plugins: none

Does the problem still occur in the latest version of Prism? Yes

Description Variables are not considered tokens?

To reproduce, highlight the following code as language-c:

int x = 5;

This results in:

<code class="  language-c">
<span class="token keyword">int</span>
 x 
<span class="token operator">=</span>
<span class="token number">5</span>
<span class="token punctuation">;</span>
</code>

Strangely, the variable x is not wrapped in a token, e.g. <span class="token variable">x</span>. This makes Prism useless for my requirements, because I just want to highlight variables.

The same behavior happens for many other languages.

jameshfisher commented 4 years ago

Compare, for example, the Rouge highlighter, which parses the above example as

<code id="parse_code">
<span class="kt">int</span> 
<span class="n">x</span> 
<span class="o">=</span> 
<span class="mi">5</span>
<span class="p">;</span>
</code>

That is, variables are given the token class n.

jameshfisher commented 4 years ago

Note the same issue is present in highlight.js: https://github.com/highlightjs/highlight.js/issues/2839

RunDevelopment commented 4 years ago

Yes, we usually don't tokenize variables because it's not meaningful. In your example, x is clearly declared a variable but this isn't the case when x is used. Depending on where in the code x is used, it might be a class, struct, typedef, variable, macro, field, or some other thing I forgot. Since we have to be able to highlight small code snippets without much context, we can't just assume that x will always be one thing. (Also, tokenizing it as a variable only if certain isn't a good strategy either because this will lead to inconsistent highlighting.)

(This is actually a really complex problem that can't be solved without access to the full source code in some languages. E.g. in C++, you can't know what the names in foo::baz::baz() refer to. They might be namespaces, classes, enums, fields, functions, etc. It's impossible to know without the full source code and (because of macros and template) processing that source code (we'd basically have to implement a C++ compiler lite in JS).)

It also doesn't make sense for us to add a (language-specific) identifier token. You can generally assume that everything that isn't tokenized by Prism is a variable, struct/class field, or similar. So if you needed to highlight all of those, you can set the default color/style of your theme to the color/style you want for variables and overwrite it for all other tokens. Maybe this will work for you?

If your requirement was to highlight variables accurately (e.g. give static and local variables a different color), then I'm sorry to say that Prism isn't powerful enough to do that. In this case, you will need a full parser.

That being said, what exactly is your requirement? Maybe it's still possible to do with Prism.

jameshfisher commented 4 years ago

Hey, thanks for your thoughts - and you're right about much of the complexity in identifying exactly what something is. However, I would say that in foo::baz::baz(), there are three "identifier" tokens. What exact domain object they "identify" (e.g. classes, functions, etc) is not for Prism to solve. At the level of syntax highlighting, they're just (language-generic) identifier tokens.

I'm trying to convert my site from Jekyll to Eleventy. In Jekyll, I'm using the Rouge highlighter, which does identify identifiers/names/variables. I am using this for some minimal "highlighting", copying the K&R printed style in which identifiers are underlined. See an example page: https://jameshfisher.com/2016/12/08/c-array-decaying/

RunDevelopment commented 4 years ago

Yeah, the above workaround won't work for underlining.

Tokenizing all identifiers is an interesting idea but I don't think that Prism will do this any time soon. We'd have to change >200 languages and there's also the question of whether we should give already tokenized identifiers (e.g. functions) an alias. Honestly, that's quite a bit of work for not much gain, IMO.

I'm sorry to say that you won't be able to use Prism for your use case.

At least not with vanilla Prism. You could add your own tokens to Prism's C language definition:

// load all of Prism before running this
Prism.languages.c.identifier = /\b[a-z_]\w*\b/i

While we do support the modification of language definitions, it might cause some problems with other language definitions but usually won't. However, you will most likely be better off with a syntax highlighter that supports all of this out of the box.

jameshfisher commented 4 years ago

@RunDevelopment Thanks for your help 👍 I agree Prism won't help me for this. In the short term I'm going to abandon syntax highlighting as it wasn't doing much for me anyway.

Some day I plan to implement my own highlighter that semantically highlights variables, in the style of https://evanbrooks.info/syntax-highlight/v2/. But that's a story for another day!

RunDevelopment commented 4 years ago

Interesting idea for highlighting. While totally out-of-scope for Prism, I can see this being useful if done well.

I'll close this issue now.

PrismJS / prism

Variables are not considered tokens? #2625