PrismJS / prism

Lightweight, robust, elegant syntax highlighting.
https://prismjs.com
MIT License
12.27k stars 1.29k forks source link

Document all token types #2849

Open RunDevelopment opened 3 years ago

RunDevelopment commented 3 years ago

Motivation Themes depend on Prism producing tokens with specific types (or aliases). Right now, we do not guarantee or document any of those types.

Description Document all standard token types (e.g. keyword, comment, ...). It should explain the general concept behind each token type and give at least one example.

The documentation should also include how languages are embedded. Example.

We should also guarantee that these concepts within languages are guaranteed to use these token types. (E.g. we guarantee that keywords always have a keyword class.) It might sound like we already do this but this is not the case right now. (E.g. we have many languages with operator keywords (e.g. NOT in SQL) that do not have a keyword class.)

joshgoebel commented 3 years ago

(E.g. we have many languages with operator keywords (e.g. not in Python) that do not have a keyword class.)

https://github.com/PrismJS/prism/blob/master/components/prism-python.js#L54

What am I missing or not understanding about that statement? Looks like not is a keyword to me?

And this is a great idea, BTW.

RunDevelopment commented 3 years ago

Looks like not is a keyword to me?

Oops. Wrong language. I meant SQL. Thanks for pointing that out!

joshgoebel commented 3 years ago

SQL has a keyword class also... but not is an operator there. Are you talking about resolving this inconsistency between grammars?

https://github.com/PrismJS/prism/blob/master/components/prism-sql.js#L19

RunDevelopment commented 3 years ago

Are you talking about resolving this inconsistency between grammars?

Yes, that too. My main intention was to enable themes to decide whether operator keywords are to be highlighted as keywords or operators. Right now, Prism languages make that decision (and that pretty inconsistently as you pointed out) by assigning them one type (either keyword or operator). In the case of operator keywords, the best solution would probably be to use an alias for keywords, so we get a CSS ~class name~ selector of .keyword.operator.

joshgoebel commented 3 years ago

.keyword.operator

How would you encode that to HTML then? Nested tags? I dislike that because it makes operator ambiguous on it's own you then have to use something like code > operator to scope it to the top level, but for us that doesn't even work because operators don't always have to be top-scope... we might legitimately have an operator inside of some other scope (other than keyword)... so then the order of the CSS becomes super important. And even then you could see breakage because say .operator defines a background while .keyword.operator does not.

Is that a bug? A feature?

Basically it seems nested tags makes everything a lot more error prone. At least that was my thinking.

We're going thru this same thing right now and I think I've decided against nesting and instead going to flatten scopes to either .keyword--operator or .keyword\.operator or something (literal .). But very curious to hear your thoughts.

Ref: https://github.com/highlightjs/highlight.js/issues/2521 https://github.com/highlightjs/highlight.js/issues/2500

joshgoebel commented 3 years ago

In my idea world the top scope is general and the lower scopes are specific. So I'd expect operator.keyword to be styled operator if there is no specific styling for operator.keyword but that's not true for keyword.operator which is a WHOLLY different beast than operator... of course all of this is in the naming and how you setup these structures. :)

RunDevelopment commented 3 years ago

How would you encode that to HTML then?

<span class="token operator keyword">NOT</span>

With .operator.keyword, I meant the CSS selector for the generated elements. My bad.

Nested tags?

No. I don't want

<span class="token operator"><span class="token keyword">NOT</span></span>

It's a lot harder to make specific styles like this and both the generated HTML and the Prism languages definitions generating the HTML will become unnecessarily bloated.

We're going thru this same thing right now

I read through it and I think I'm aiming for the same thing.

I like the "labels" approach (option 2). I.e. NOT in SQL is both an operator and a keyword. This labeling approach has a few advantages:

However, labels force themes to "resolve conflicts". If more than one rule applies for a token, the theme has to decide how these rules interact. This is good because it gives themes all the freedom but also all the responsibility. This means that theme authors have to be careful because every rule might interact with every other rule. This won't be a problem if the theme only set colors but it will get tricky for themes that have rules with background colors, font styles, text decorations, and so on.

I think that this is the better approach: It's easy to implement, extendible, and clearly separates the concerns of tokenization (= understanding what code means) and styling (translating code meaning into colors).

joshgoebel commented 3 years ago

With .operator.keyword, I meant the CSS selector for the generated elements.

Yeah I think I understood, I was just asking for clarification to make 100% sure.

the theme has to decide how these rules interact

Sadly I always have to keep in mind 100 existing themes that I'd really like to hand edit as little as possible. :-)


<span class="operator keyword">NOT</span>

We're still talking different things then. What I was proposing was:

<span class="operator operator.keyword">NOT</span>

Lets switch from SQL to a more common pattern that I think illustrate the different better.

class Render {}

We've traditionally render this as:

<span class="class"><span class="keyword">class</span> <span class="title">Render</span></span> {}

Which has been targeted with CSS like .class .title... (a title inside a class) But we are trying to get away from (the deep nesting of scopes). So we're switching to something more like:

<span class="keyword">class</span> <span class="title.class">Render</span></span> {}

And now I'm seeing my own point disappear before my eyes... I guess Render is technically both a title and a class, though for us class has always had a different meaning, so just doing it your way would require rethinking a bunch of existing CSS... Let me look at my other examples... ah, here we go:

This is more how I think about scopes (TextMate style)...

The sub scope refines the main scope - it is not a full scope in it's own right. While class can make sense on it's own in many cases the sub scopes do not... such as block and line - or at least their meaning would NOT at all be semantically clear if you had a block or line that wasn't a comment... you'd have a WHOLLY different thing and likely to create some weird unintended interactions with your CSS.

Your approach would seem to make conflicts of sub scopes and top-level scopes more likely, no?

Thoughts?

RunDevelopment commented 3 years ago

Let me start by saying that there is a problem with "sub-scopes" in textmate scopes: you have to impose a hierarchy.

Sub scopes necessitate a hierarchy because scopes are an ordered list of names. This a problem as there may not be a hierarchy between sub scopes.

E.g. Let's say we have a Rust doc comment (e.g. /// doc). Top level is comment with the sub scopes line, triple-slash, and documentation. But how do we order the sub scopes? triple-slash should be a sub scope of line but there is no natural hierarchy/order between documentation and line. Should it be comment.line.triple-slash.documentation or comment.documentation.line.triple-slash? Further, let's say in a previous version we only had comment.line.triple-slash or only comment.documentation, then how will we add the new sub scope (documentation or line.triple-slash respectively) in a way that is backward compatible? We are forced to append new sub scopes even if it doesn't make sense, right?

The problem is that scopes are a list but the logical hierarchy is at least a tree. The tree for the Rust example above is:

comment
|-- documentation
`-- line
    `-- triple-slash

But a tree of names is 1) a lot more complex and 2) translates even worse to CSS class names than scopes.


Back to the label approach:

their meaning would NOT at all be semantically clear if you had a block or line that wasn't a comment...

I see what you mean. A style rule for just block tokens just doesn't make much sense without more context (a block of what?).

However, is that really a problem?

It doesn't make sense, so why would anyone do it? I mean you can have the same nonsense in textmate too, right? Isn't just block a valid scope selector as well?

Your approach would seem to make conflicts of sub scopes and top-level scopes more likely, no?

Yes, there will be a bunch of conflicts but we will use CSS to take care of them.

Let's look at a .operator.keyword element again. How do the rules for these labels actually look like?

.token.keyword { color: blue; }
.token.operator { color: red; }

Q: How does CSS resolve this conflict? A: CSS uses rule order. If two rules in the same stylesheet match the same element and have the same specificity, the rule defined last (top to bottom order) will be used. In this example, the element will get the color red.

Adding more specific rules is easy as well.

.token.keyword { color: blue; }
.token.operator { color: red; }
.token.operator.keyword { color: yellow; } /* works as expected */

To make theme authors aware that rule order is important, I suggest sorting rules by ascending specificity (as seen in the example above). Within a specificity level, rule order matters so authors have to decide which rules override which by bringing them into the right order.

Thoughts?

joshgoebel commented 3 years ago

You're persuading me a bit, but I'll answer your points and see if you find anything compelling. :)

Should it be comment.line.triple-slash.documentation or comment.documentation.line.triple-slash?

For Highlight.js I'm not much interested in more than 2 scopes... so this isn't a huge problem for us I don't think (juggling 3-4 scopes and ordering). We've gotten by with a single scope (plus nesting that's typically 1 level deep) for 10+ years so it seems like 2 scopes should be more than enough to get the job done. So that simplifies matters considerably.

And for backwards compatibility we're probably adding to our existing scopes... so in many cases we already know which scope comes first... it's only a matter of adding some specificity... is it just a number? or a number.ip_address, etc...

But to actually answer your question there is 10 years of prior art with TextMate grammars in the real world... someone else has likely already solved this ordering problem, no? I suppose you might hate their answer, but I'm sure you could find it and adopt it - I doubt there is a need to reinvent the wheel here. So I'm not sure I would agree that that is a large or unsolvable problem - since TM clearly solved it 10 year ago. :-)

Isn't just block a valid scope selector as well?

I dunno, lol... I mean anything CAN be a scope in TextMate... it's just up to styles to colorize them... and I'd say looking at the most popular styles would give you some idea of what the "canonical" scopes are. It's not listed on the page that I linked you to. (other than comment.block)


Your CSS examples do look pretty though. :) Let me skim thru the TM document again.

joshgoebel commented 3 years ago

Would working together to come up with a set of common scopes across the two different engines make any sense at all or does that defeat the purpose of having different engines? :-)

RunDevelopment commented 3 years ago

Don't worry, I don't hate TM scopes. Especially their scope selectors elegantly solve a lot of problems. They aren't perfect (see example above) but they are very good.

The main problem I have with TM is that scope selectors just don't translate well into CSS selectors.

VSCode

That being said, I came across this blog post from the VSCode dev team. It describes how they transitioned to their current TM-based highlighter and what they had before. The interesting part is that they used to use the label approach.

E.g. the following TM scopes

meta.function.js.definition.punctuation.block

were turned into the following HTML

<span class="token meta function js definition punctuation block">{</span>

They summarized this approach for TM themes as follows:

What we were doing was plain wrong and "approximate" is a very nice word for it :).

We would then leave it up to CSS to match the "approximated" scopes with the "approximated" rules. But the CSS matching rules are different from the TextMate selector matching rules, especially when it comes to ranking. CSS ranking is based on the number of class names matched, while TextMate selector ranking has clear rules regarding scope specificity.

That's why TextMate themes in VS Code would look OK, but never quite like their authors intended. Sometimes, the differences would be small, but sometimes these differences would completely change the feel of a theme.

Basically, the label approach is mostly incompatible with TM themes. This is something to keep in mind and is relevant for #2848.

Approaches

What follows is a list of discussed approaches and how they relate the TM scopes (assuming no parent scopes). I also added a hybrid approach into the mix. All approaches assume that we don't know anything about the CSS theme at runtime.

(I will use _s to separate different names within a CSS class. I don't want to use .s because this makes writing the CSS examples harder.)

TM scope approach

A single TM scope can be expressed as CSS classes like this:

a.b.c.d
<span class="a a_b a_b_c a_b_c_d">{</span>
/* Rule order matters because they all have the same specificity */
.a { color: red; }
.a_b { color: blue; }
.a_b_c { color: green; }
.a_b_c_d { color: yellow; }

Advatanges:

Disadvantages:

This is what you outlined in highlightjs/highlight.js#2521, right @joshgoebel?

Label approach

(I only list this here for completeness)

The label approach sees scopes as a set of string (TS definition: type LabelScope = Set<string>). CSS classes will be constructed like this:

{a,b,c,d}
<span class="a b c d">{</span>
/* Rule order doesn't matters in this example */
.a { color: red; }
.a.b { color: blue; }
.a.b.c { color: green; }
.a.b.c.d { color: yellow; }

Advatanges:

Disadvantages:

Hybrid approach

Thinking about how Prism does things again, maybe we could use a hybrid approach?

Instead of saying that a scope is a list of names, how about we say: A scope is a two-element tuple where the first element is the main scope and the second element is a set of sub scopes. (TS definition: type HybridScope = [string, Set<string>])

CSS classes will be constructed like this:

a.{b,c,d}
<span class="a _b _c _d">{</span>
/* Rule order doesn't matters in this example */
.a { color: red; }
.a._b { color: blue; }
.a._b._c { color: green; }
.a._b._c._d { color: yellow; }

(_ is a prefix used to identify sub scopes.)

Advatanges:

Disadvantages:

Conflicts

Conflicts occur if the selector of two CSS rules matches the same element. All approaches will cause more or less rule conflicts.

It important to note that all conflicts can be resolved using CSS specificity and rule order. However, this will likely require a lot of care by the theme author.


Is there anything I missed? Are there any other approaches that might be interesting?

I'd be interested to hear your thoughts @joshgoebel.

RunDevelopment commented 3 years ago

I thought of yet another approach. It's equivalent to the TM scope approach but eliminates some of its problems.

TM scope prefix approach

a.b.c.d
<span class="a _b __c ___d">{</span>
/* Rule order doesn't matters in this example */
.a { color: red; }
.a._b { color: blue; }
.a._b.__c { color: green; }
.a._b.__c.___d { color: yellow; }

The trick is to encode the position of each name using a prefix (e.g. no prefix = index 0, one prefix char = index 1, two prefix chars = index 2, and so on). This basically takes care of the differences between TM scope selectors and CSS selectors as far as I can see.

Repeating the same character might seem a bit wasteful but it should be fine in practice. Scope names aren't single letter names after all.

Advantages:

Disadvantages:

Apart from the limitations of TM scopes themselves, I think this accurately translates TM scopes into CSS classes.

Keep in mind that this doesn't cover parent scopes. They are still a problem, I think.

joshgoebel commented 3 years ago

Oh wow you've put so much thought into this. Let me come along at a bit of a higher level and see if I can add anything useful.

TM scope approach

I just realized I have experience with this. This is exactly what Pastie did ~15 years ago using TextMate grammars on the server-side. There was a nice TM plugin that converted themes to CSS and it did so using this strategy. A simple case:

/* entity.name.section */
pre.textmate-source .entity_name_section {

This works well for "multi-scope" matchers as well:

Something can be scooped as BOTH a meta.tag AND and entity. (Recall that in TM things can have nested scopes so any piece of content can have numerous scopes)

pre.textmate-source .meta_tag .entity {

And TextMate even had a plugin that pasted to us raw HTML... and in those cases it would just upload an HTML payload (gzipped)... so Pastie could even render content for grammars that we knew nothing about - because it was just rendered a cached copy of the parsing that TM had already done on the client-side... (yet we always used our CSS)

The theme fidelity of this approach was always 100% in my recollection.

Duplication of almost all name

Is this truly a problem though? This content is computer generated. Now 15 years ago because Pastie streamed HTML "over the wire" I did add a single filter for punctuation... TM loves to scope its punctuation. For C-like code it just gets ridiculous with 100s of kb of HTML just for commas, squiggly brackets, etc... so when we parsed raw text we'd drop any scope with punctuation:

print "<span class='#{name_to_class(name)}'>" unless name=~/punctuation/

But still the TM generated content still included those things and I don't recall it every being THAT much of a problem...

TM scope prefix approach

This would seem to be CLOSE to the former, but slightly worse to me. And (see above) I'm not sure the "space saving" is a benefit that matters. It adds a degree of fuzz or ambiguity with regard to sub scopes. ___d can be used/paired in different ways. The actual TM scopes are string.unquoted and string.quoted but now you open up possibilities like link.quoted or line.unquoted. This is either quite confusing or encourages creativity. (see my notes on label approach)

Label approach

I've implemented this for Highlight.js (changed a few lines) and it's not terrible (though I have very little experience with it so far an only using it for a few things). I had to reorder some of our CSS (for legacy issues) but I was trying to rework our docs when I realized how this can get muddled. We have an example:

  var GOOGLE = https://www.google.com/

I was trying to explain that while link for us is technically a "markup" tag that if a language had links as first-order types like this that it could make sense to use link to scope the URL there. But then I added a caveat that theme authors probably don't think this way. Often the more "specialized" scopes (markup, diff, HTML) almost end up creating their own sub-themes within some themes. IE, someone is making sure "code" looks good and making sure "markup" looks good, but not imagining markup MIXED with code. So link might look out of place, so then I suggested string (since it will definitely be considered in the ascetic of code)...

But then I thought why not string.link... or perhaps it's a link.string... are the theme authors thinking of any of this? Are the theme authors talking to the grammar authors? Is it easier to say:

OR

Hybrid approach

I don't see a huge technical distinction here between labels... I realized that while I was implementing labels I was almost thinking of them in my head more like this anyways, but without the naming distinction (b vs _b.) (which is why link.string vs string.link threw me for a loop - because those are BOTH parent scopes in HLJS as we have things now)

So essentially you have parent scopes, and then any sub-scopes are tags... they don't exist in a hierarchy. IE: string.(utf8.unquoted) The only reason to give them a prefix (_) is to prevent the sub scopes from conflicting with the parent scope (right?), which I guess is something...

I honestly think (on it's own merits) there is a lot to like about this approach - from a simplicity perspective... It does have some of the "endless creativity" issues I mentioned above, but now restricted to sub-scopes, which I think would be far more manageable.

But it's taking quite a step away from TextMate scoping...


If it sounds like I strongly prefer TM scopes that's inaccurate. I think at this point it seems TM vs hybrid is very much about one's goals. Hybrid seems clearly better (organizationally) than Labels because it enforces SOME hierarchy and that's good for themes consistency and development and understanding the domain. You may also want to also spend some time thinking about theme compatibility and see if that changes your thinking on any of this. For example... you say you want character to be it's own thing... so one day you make a herculean effort to update ALL the grammars to add this distinction.

Super duper. Now characters are broken in every old theme that is only aware of strings.

I've considering automated processing for our CSS files (unless they are tagged as "leave me alone") to "fill in" gaps like this... say if we added character (which we don't have) then I'd add a processing rule to COPY the style of string to the style of character... and if a theme author comes along and wants to change this then they have to patch their theme to add the missing selector... I was imagining an empty rule for this: .hljs-character {} to say "I know about character, but I choose not to style it".

Thoughts? :-)

joshgoebel commented 3 years ago

I'm not sure you really desire scopes 4 levels deep for Prism, do you? Is some of this discussion in the abstract - or are you really wanting to move towards a TM level of nuance in your highlighting? I think I always intended to limit it to parent.child for HLJS 1st party grammars honestly, yet all this talk of labels vs hybrid is making me imagine a slightly wider field... Even if you go with labels or hybrid I still feel like you'll have to have a "common/blessed" list of conventions that MOST grammars/themes follow... just to allow cross-usability of all themes with all grammars.

So I think that in practice hybrid would never really be:

a (b,c,d,e,f,g)

But most likely ever only:

So unless you're truly trying to copy TM theme fidelity you'd never have such deep scopes in practice. To me that also makes the "but we're repeating the names multiple times" disadvantage less of a big deal.

joshgoebel commented 3 years ago

Perhaps useful:

joshgoebel commented 3 years ago

I think for the v11 development cycle I'm going to try "Scope Prefix" but moving the _ to the end, ie:

A title.class (name of a class):

<span class="hljs-title class_">State</span>
.hljs-title.class_ {
  color: purple;
}
joshgoebel commented 3 years ago

It also offers us some protection against collision

Do you all never run into collision issues with conflicting class names outside your hierarchy in Prism.js?

joshgoebel commented 3 years ago

A related item (at least for us.. I'm not sure how granular Prism can be or wants to be)... given the following:

var a = "holiday";

The string has 3 scopes:

But is that:

Our engine allows for both pretty easily so we need to decide which is recommended/canonical. I'd say the latter is a bit more CSS like... and we've traditionally always highlighted the whole item as a "string"... so updating a grammar to highlight only the begin/end pairing is 1 or 2 lines of code vs switching to an entirely new syntax if one wants to do this same thing with multi-match by giving all 3 items individual scopes.

Although it's also possible we soon add a middleScope (to scope just the middle of a match - what is between begin and end) in which case updating would again be trivial either way we went.

But if there are any reasons to prefer the "flatter" approach...

hoonweiting commented 3 years ago

Hi! I'd like to help document token types, especially since it'll help a lot with theme creation. I'll admit I'm not as familiar with this repo as I am with prism-themes, but I think I can find my way around enough to write these docs.

Just a few questions at this point:

  1. Where should this documentation go (eg. the website itself, or a markdown doc of its own)?
  2. Is this issue similar to #2083 (which happens to mention that token type guidelines should be published on the website, hence answering the previous question)?
  3. I'm not super sure what is meant by "The documentation should also include how languages are embedded." I will also admit that I don't fully catch the above discussion!
  4. Should fulfilling the token types guarantee for every language be contained in the same PR? I foresee it taking much more time if that's the case.
RunDevelopment commented 3 years ago

Thank you very much @hoonweiting!

  1. I think making it a new page on our website would be best.
  2. No. #2083 is about how language definitions should tokenize text and assign token names. However, it is related in that the token documentation will be the basis for #2083.
  3. Prism supports embedded languages where one language contains code from a different language. E.g. CSS in HTML, JS in HTML, Bash in Shell-session, and CSS in JS. To support language-specific styles, all embedded languages are wrapped in their own token which includes a CSS class language-<embedded language>. This fact should be mentioned.
  4. No, that would be too much at once. We can implement this gradually after laying out the plan.

If you have any other questions or need help, feel free to ask any time!

hoonweiting commented 3 years ago

I've started work on it, but I would like some input/help at this point!

Is the file name tokens.html suitable? Currently it contains the sections "Standard tokens" and "Embedded languages" (though maybe the latter should be a subsection of the former). It makes some sense, to me at least, for the token type guidelines to have its section on this page too, since it's an element of tokens, but I'm not sure whether that would make the page too lengthy!

(On a slightly related note, is Prism open to larger website design inputs/PRs? (Not so much of an overhaul, more of a makeover.) It's something I can and would like to help with, though it'll have a longer runway compared to this doc, for instance.)

Which tokens should be considered standard tokens? Initially I thought I'd write about all the tokens that I included in prism-theme-template.css, but it occurred to me that they might not all be standard tokens! For example, bold and italic probably aren't found in a majority of languages, but they're still part of the 'core' tokens (if we consider all the tokens in the official themes to form the 'core' tokens). For ease of reference, I'm copying all of them here, and bolding those that I think should be considered a standard token:

comment, boolean, number, char, string, url, regex, punctuation, constant, variable, property, operator, keyword, builtin, class-name, function, inserted, deleted, bold, italic, important, prolog, doctype, cdata, namespace, tag, selector, attr-name, attr-value, atrule, entity.

Also, do you think there more tokens that should be considered standard tokens that aren't listed at all? Random sampling on the FAQ page probably isn't the best way, haha.


Finally, this is very unrelated, but is 'font-matter' in line 35 supposed to be 'front-matter'? https://github.com/PrismJS/prism/blob/8daebb4ab936c60a17f8e35f558e294ee6869974/components/prism-markdown.js#L29-L41

I don't think it matters too much but yeah, just something I saw while poking around.

RunDevelopment commented 3 years ago

Is the file name tokens.html suitable?

Yes, sounds good.

I'm not sure whether that would make the page too lengthy!

The section of embedded languages can stay. It's important.

Which tokens should be considered standard tokens?

I think all token names you listed should be standard tokens. Details:

While all of these tokens are specific to certain language types, I do think that theme should support all of them and language definition authors should be aware of them.

do you think there more tokens that should be considered standard tokens that aren't listed at all?

No, the list is long enough as is for now. We can also add more later.


On a slightly related note, is Prism open to larger website design inputs/PRs? (Not so much of an overhaul, more of a makeover.)

First of all, yes. We know that the website is outdated and want to update it or make a new one but we hadn't had the time. We would greatly appreciate your help!

But also, a little no. It's a little tricky to say yes right now because we want to start with v2.0 soon and will likely greatly restructure the project in the process. I don't want you to start working on this only for us to change the whole project causing you additional work.

Could you please open an issue for this and tag all Prism maintainers?


Finally, this is very unrelated, but is 'font-matter' in line 35 supposed to be 'front-matter'?

Yep, I made a typo. Thanks for noticing! I'll fix it.

hoonweiting commented 3 years ago

Thanks @RunDevelopment!

I have started writing some 'definitions' down, and started adding/copying some examples too. I think the definitions need a lot of work, you'll probably see what I mean when I submit a PR! In terms of examples, I'm mostly grabbing them from JS, Python, HTML, CSS, and Markdown, though the odd example will need to be taken from another language, like Diff and LESS. I'm not sure if this is the way to go. I mean, it reduces the number of languages that will need to be loaded, but could this be too minimalistic?

Also, I'm having a bit of trouble with char. I thought it was a little more common, but I could only find it in about seven languages (Ada, Eiffel, Elm, Haskell, Idris, PureScript, Rust, maybe there are more). However, I did find a few more languages with a character token, some aliased it with string (Racket, Reason, Scheme, Smalltalk, might have missed some), but I think there's at least one (Rip) that doesn't alias character to anything. Is there some sort of difference between char and character? I kind of assumed they all refer to unicode characters or something (I googled a little), but I've not had to deal with characters myself yet, only strings, so I could be wrong.

RunDevelopment commented 3 years ago

I'm not sure if this is the way to go. I mean, it reduces the number of languages that will need to be loaded, but could this be too minimalistic?

Don't worry about that. Load as many languages as you need.

I recommend using Autoloader, it will load all necessary languages for you. Also, Autoloader won't block the page when loading, so you really don't have to worry about load times.

I'm having a bit of trouble with char

Yeah, we combine and strings and chars in a lot of languages.

I think the best way going forward is to document char as the token for characters and then change all languages that don't follow this later. Consistently using char is the way to go IMO.

A lot of people (myself included) probably didn't use char because they didn't know that it was a standard token, so they used string or character.

(If you need an example language for char, I suggest using Rust.)

I kind of assumed they all refer to unicode characters or something

Depends on the language. Many modern languages (e.g. Rust, Swift, Go) use Unicode characters (21 bits), languages with UTF-16 string (e.g. Java, C#) typically use UTF-16 char codes (16 bits), and in many older languages (e.g. C) a char is simply a byte (8 bits).

That being said, they all try to implement the same concept (more or less), so the language-specific implementation shouldn't matter in terms of highlighting.

hoonweiting commented 3 years ago

Also, Autoloader won't block the page when loading, so you really don't have to worry about load times.

Ahh okay thanks! I was more worried about the client needing to download more information for each language I use, so I was trying to min-max, in a way. Then again, perhaps I have lost sight of what a 'large' webpage is, and that a couple of kB is really nothing! Plus, I am fortunate to live in a country with relatively fast Internet speeds, so it's even harder to tell. But I am assured now!

I think the best way going forward is to document char as the token for characters and then change all languages that don't follow this later.

Got it! I'm currently using Elm, but I'm sure it doesn't hurt to add more examples later on.

Also, I see that Elm supports '\u{0000}' and similar as a valid Char, but Prism doesn't. I think this is straightforward enough for me to add in the regex, would it be alright if I send in a PR for that?

Depends on the language. Many modern languages (e.g. Rust, Swift, Go) use Unicode characters (21 bits), languages with UTF-16 string (e.g. Java, C#) typically use UTF-16 char codes (16 bits), and in many older languages (e.g. C) a char is simply a byte (8 bits).

Wow! TIL. Thank you!

RunDevelopment commented 3 years ago

would it be alright if I send in a PR for that?

Of course! Thank you!

hoonweiting commented 2 years ago

Hey @RunDevelopment! I was wondering how I could help out with the second half of this issue. For starters I could probably swap out character for char, but that's probably just the tip of the iceberg.

  1. Prism has a lot of languages, so I guess that the most efficient way of doing this is to go through languages one by one? (And perhaps for consistency, one PR per language that needs changes?)

  2. Suppose a token name gets swapped out for an equivalent standard token. Should the old token be left as an alias, or removed completely? I'm thinking more along the lines of swapping out character for char for example, which are basically the same thing.

  3. If a language uses a specific term that isn't a standard token name (eg. function-definition), but there is a standard token name that is semantically similar (eg. function), should the standard token name be added as an alias if not already included?

Huh, I guess that's all the questions I have for now, maybe I'll think of more eventually. And oh yeah, this is probably going to take a while, so how should we go about tracking the progress? This would be especially helpful if more people want to pitch in too!

RunDevelopment commented 2 years ago

Thank you for the offer!

  1. One PR per language would be nice.

  2. That is a very good question. I'm generally okay with swapping out names. E.g. characterchar is okay. However, there might be cases where it isn't as clear-cut. We can talk about those in the tracking issue as we find them.

  3. It depends on the language. In most cases (I think?), standard-token-name aliases should be included, yes. However, some languages use non-standard names without aliases to allow opt-in styles by themes.

    Due to Prism's technical limitations, there are some cases where we get the semantic meaning of a token right 80% of the time. This can cause very inconsistent highlighting in some languages, so we mostly choose to not highlight these cases and there are 2 ways to get no highlighting:

    1. Don't tokenize it. Nobody gets highlighting.
    2. Give it a non-standard token name with no standard aliases. People that are ok with false positives/negatives can customize their themes and opt-in to highlighting these non-standard tokens.

    The problem is finding out whether aliases were forgotten or intentionally omitted.

    You probably have to go through the file history and read the PR and commit comments... Ouch. You can also me. I was (and am) involved in many of Prism's languages, so I will likely know the history for some.

And oh yeah, this is probably going to take a while, so how should we go about tracking the progress?

Good idea. Could you make a tracking issue? Just a simple task list like this should be enough.

```md - [ ] abap - [ ] abnf - [ ] actionscript - [ ] ada - [ ] agda - [ ] al - [ ] antlr4 - [ ] apacheconf - [ ] apex - [ ] apl - [ ] applescript - [ ] aql - [ ] arduino - [ ] arff - [ ] asciidoc - [ ] asm6502 - [ ] asmatmel - [ ] aspnet - [ ] autohotkey - [ ] autoit - [ ] avisynth - [ ] avro-idl - [ ] bash - [ ] basic - [ ] batch - [ ] bbcode - [ ] bicep - [ ] birb - [ ] bison - [ ] bnf - [ ] brainfuck - [ ] brightscript - [ ] bro - [ ] bsl - [ ] c - [ ] cfscript - [ ] chaiscript - [ ] cil - [ ] clike - [ ] clojure - [ ] cmake - [ ] cobol - [ ] coffeescript - [ ] concurnas - [ ] coq - [ ] cpp - [ ] crystal - [ ] csharp - [ ] cshtml - [ ] csp - [ ] css - [ ] css-extras - [ ] csv - [ ] cypher - [ ] d - [ ] dart - [ ] dataweave - [ ] dax - [ ] dhall - [ ] diff - [ ] django - [ ] dns-zone-file - [ ] docker - [ ] dot - [ ] ebnf - [ ] editorconfig - [ ] eiffel - [ ] ejs - [ ] elixir - [ ] elm - [ ] erb - [ ] erlang - [ ] etlua - [ ] excel-formula - [ ] factor - [ ] false - [ ] firestore-security-rules - [ ] flow - [ ] fortran - [ ] fsharp - [ ] ftl - [ ] gap - [ ] gcode - [ ] gdscript - [ ] gedcom - [ ] gherkin - [ ] git - [ ] glsl - [ ] gml - [ ] gn - [ ] go - [ ] graphql - [ ] groovy - [ ] haml - [ ] handlebars - [ ] haskell - [ ] haxe - [ ] hcl - [ ] hlsl - [ ] hoon - [ ] hpkp - [ ] hsts - [ ] http - [ ] ichigojam - [ ] icon - [ ] icu-message-format - [ ] idris - [ ] iecst - [ ] ignore - [ ] inform7 - [ ] ini - [ ] io - [ ] j - [ ] java - [ ] javadoc - [ ] javadoclike - [ ] javascript - [ ] javastacktrace - [ ] jexl - [ ] jolie - [ ] jq - [ ] js-extras - [ ] js-templates - [ ] jsdoc - [ ] json - [ ] json5 - [ ] jsonp - [ ] jsstacktrace - [ ] jsx - [ ] julia - [ ] keepalived - [ ] keyman - [ ] kotlin - [ ] kumir - [ ] kusto - [ ] latex - [ ] latte - [ ] less - [ ] lilypond - [ ] liquid - [ ] lisp - [ ] livescript - [ ] llvm - [ ] log - [ ] lolcode - [ ] lua - [ ] magma - [ ] makefile - [ ] markdown - [ ] markup - [ ] markup-templating - [ ] matlab - [ ] maxscript - [ ] mel - [ ] mermaid - [ ] mizar - [ ] mongodb - [ ] monkey - [ ] moonscript - [ ] n1ql - [ ] n4js - [ ] nand2tetris-hdl - [ ] naniscript - [ ] nasm - [ ] neon - [ ] nevod - [ ] nginx - [ ] nim - [ ] nix - [ ] nsis - [ ] objectivec - [ ] ocaml - [ ] opencl - [ ] openqasm - [ ] oz - [ ] parigp - [ ] parser - [ ] pascal - [ ] pascaligo - [ ] pcaxis - [ ] peoplecode - [ ] perl - [ ] php - [ ] php-extras - [ ] phpdoc - [ ] plsql - [ ] powerquery - [ ] powershell - [ ] processing - [ ] prolog - [ ] promql - [ ] properties - [ ] protobuf - [ ] psl - [ ] pug - [ ] puppet - [ ] pure - [ ] purebasic - [ ] purescript - [ ] python - [ ] q - [ ] qml - [ ] qore - [ ] qsharp - [ ] r - [ ] racket - [ ] reason - [ ] regex - [ ] rego - [ ] renpy - [ ] rest - [ ] rip - [ ] roboconf - [ ] robotframework - [ ] ruby - [ ] rust - [ ] sas - [ ] sass - [ ] scala - [ ] scheme - [ ] scss - [ ] shell-session - [ ] smali - [ ] smalltalk - [ ] smarty - [ ] sml - [ ] solidity - [ ] solution-file - [ ] soy - [ ] sparql - [ ] splunk-spl - [ ] sqf - [ ] sql - [ ] squirrel - [ ] stan - [ ] stylus - [ ] swift - [ ] systemd - [ ] t4-cs - [ ] t4-templating - [ ] t4-vb - [ ] tap - [ ] tcl - [ ] textile - [ ] toml - [ ] tremor - [ ] tsx - [ ] tt2 - [ ] turtle - [ ] twig - [ ] typescript - [ ] typoscript - [ ] unrealscript - [ ] uri - [ ] v - [ ] vala - [ ] vbnet - [ ] velocity - [ ] verilog - [ ] vhdl - [ ] vim - [ ] visual-basic - [ ] warpscript - [ ] wasm - [ ] web-idl - [ ] wiki - [ ] wolfram - [ ] wren - [ ] xeora - [ ] xml-doc - [ ] xojo - [ ] xquery - [ ] yaml - [ ] yang - [ ] zig ``` I hope you like scrolling :)

Since it will probably be mostly the two of us working on the issue, we should both be able to edit the issue (i.e. mark items in the task list as checked). You wouldn't be able to do that I created the issue/comment.

hoonweiting commented 2 years ago

Sounds good! I'll create a tracking issue later, though I don't expect to work on it until the weekend comes around.

I haven't looked at the source files/file history/PRs/commits yet, but I'm wondering if we should get all that information in one place, like maybe inline comments with a note as to why it has no standard tokens or whatever?

Oh and I guess a related question is, how can we better communicate the lack of highlighting (with CSS) for these non-standard tokens? I see that there were some issues raised in the past wondering why a certain language wasn't getting highlighted, or why it had so little highlighting, stuff like that. I think the only indication on it on the website is in the FAQ...and hmm, I guess this should be part of the website discussion (soonTM) and not here. Blah, I'll keep that in mind for that discussion in the future.

Last question for now, perhaps would this be a good opportunity to start some work on #2850 as well?

RunDevelopment commented 2 years ago

I'm wondering if we should get all that information in one place, like maybe inline comments with a note as to why it has no standard tokens or whatever?

I wonder whether we even need to? Non-standard tokens without standard aliases are somewhat rare. A comment explaining their function (if non-obvious) would be nice but I don't think that we need to require comments.

So add comments if you think they are needed, I guess.

perhaps would this be a good opportunity to start some work on #2850 as well?

Expect for char, no. #2850 is about adding new standard tokens which is a breaking change if implemented by languages. (I simply wasn't aware that char is a standard token at the time I opened the issue.)

hoonweiting commented 2 years ago

Ah, sorry for being MIA the past few weeks, but I'm back! (Well, maybe when I wake up in the afternoon.) I'll follow your lead with what you've done so far, thanks for doing so much!!!

hoonweiting commented 2 years ago

Okay hi I'm actually alive again!!

I've decided to look at LOLCODE first (lol), and hmm, I'm having a little trouble with it. A variable in LOLCODE may be defined as such:

I HAS A <variable name> ITZ <value>

ITZ itself is a keyword that Prism already captures, but I'm not sure what the following code is supposed to capture:

https://github.com/PrismJS/prism/blob/adcc8784f2f47cb893e344c4b690bcb5132f1652/components/prism-lolcode.js#L46-L49

It's really been a while since I looked at regex (or Prism...), but is this supposed to capture whatever comes after IT, or IT itself, in Prism's context?

I'm asking this because I can't tell whether it's a typo (IT or ITZ) that needs to be fixed. IT is mentioned in the spec under statements and flow control as "the implicit IT variable", but not in the sample code in the spec. Maybe I'm also misunderstanding what an implicit variable is! I did not major in comsci and would gladly defer to the experts :")

RunDevelopment commented 2 years ago

Welcome back :)

Not an expert, but AFAIK, implicit variables are all variables that you did not define yourself. So I would also read the spec as "there is a variable called IT defined by the language."

So the variable token seems to be correct.

hoonweiting commented 2 years ago

Ah okay, thank you for the explanation! LOLCODE is good to go then!

hoonweiting commented 2 years ago

I've got another question, I'm looking at editorconfig right now, and it looks pretty good (other than the broken link to the docs), I'm just wondering if it would be more accurate to use attr-value instead of string to alias value?

https://github.com/PrismJS/prism/blob/8476a9abed2080b99d4f1e39a8ef7defacd2bea4/components/prism-editorconfig.js#L18-L23

RunDevelopment commented 2 years ago

I was wondering the same. The problem is that we don't have token names for key-value pairs. attr-name + attr-value is awkward because we defined them as markup-specific. Some use property + string (e.g. JSON).

INI, systemd, and some other simple key-value config formats have the same issue. Even config formats like JSON, YAML, and TOML are affected to a lesser extent.

Frankly, I don't have a good general solution for this.

hoonweiting commented 2 years ago

Ah, I had a different interpretation of the 'categories' on tokens.html actually; literally they're just categories, to break the table into more digestible portions, and to a smaller extent, sort of to show where those tokens are likely to be found. (Personally it took me a while to find some of the tokens when creating my first theme!) I do not find the categories to restrict anything.

An example of tokens not being restricted to its category (besides namespace) are inserted and deleted, which are in the Diff category, but can also be found in Brainfuck:

https://github.com/PrismJS/prism/blob/220bc40fb820273d62366279ba9403fe79c2e26b/components/prism-brainfuck.js#L6-L13

In the case of JSON, property + string makes sense, because JavaScript.

I'm not familiar with INI and systemd, but I just took a look and it seems like they use key and value, aliased to attr-name and attr-value respectively. The editorconfig docs calls them "Key" and "Supported values" as well, so perhaps we can bring editorconfig in line with INI and systemd?

As for a general solution...I'm really not sure! I tried looking at YAML and TOML a few hours ago, but hooo I'm not prepared to look at those regexes yet! :")

RunDevelopment commented 2 years ago

I was referring to the description of attr-name/attr-value which links them to markup tags. (I agree that the categories are just for readability.)

That being said, giving them attr-name/attr-value aliases consistently is fine IMO.

Regarding brainfuck: Standard tokens can be used as aliases to give styles to non-standard tokens. It's unusual to use insert and delete but still within what we allow.

hooo I'm not prepared to look at those regexes yet! :")

Yeah, YAML especially... YAML is a really complex language, so maybe ignore those regexes for now. If you absolutely needed to understand YAML's regexes, I would recommend reading through YAML's spec first.

hoonweiting commented 2 years ago

I see! I wanted to suggest revising the description, but I can't think of a better one at the moment; and not to mention, maybe I shouldn't be 'bending the rules' to begin with, haha.

Standard tokens can be used as aliases to give styles to non-standard tokens.

Yeah! The whole aliasing semantics thing (https://github.com/PrismJS/prism/issues/354#issuecomment-731504292). Really makes me wonder if the descriptions in tokens.html can be improved... It doesn't feel like there's a lot of room for semantic aliasing at the moment, but I don't know, I'm not a literature/language student, and have no business writing dictionaries! 🤪🤪🤪

Also, I'm looking at git right now, I wanted to add some aliases to some of the non-standard tokens (taking inspiration from the comment linked above), especially since there's been at least three separate issues filed for the seeming lack of highlighting for git blocks. However, I don't really know whether I should proceed right now, given the above discussion, and Prism's intention of leaving the theming to the user (https://github.com/PrismJS/prism/issues/1615#issuecomment-439546390)?

I will, however, submit a PR for editorconfig!

*pretends to not see YAML*

RunDevelopment commented 2 years ago

I wanted to suggest revising the description

Good idea. It might be enough to say that attr-* is primarily but not exclusively used in Markup. Or were you talking about revising something else?

I'm looking at git right now

We should probably hold that off for another day. Right now, git is a strange mix between Diff and Shell sessions. I don't know whether we should keep it like that.

Maybe we could restructure git to add diff support for Shell sessions instead of being its own language? Idk.

hoonweiting commented 2 years ago

Good idea. It might be enough to say that attr-* is primarily but not exclusively used in Markup. Or were you talking about revising something else?

You've got it! In addition to that, it'd be nice to look over the other descriptions too.

Something that's been bugging me is how restrictive the descriptions feel, like only one definition is locked in. Some words have multiple definitions, and maybe we can try having alternative descriptions too? I don't have any solid suggestions right now, so maybe this will not bear any fruit after all!

Right now, git is a strange mix between Diff and Shell sessions.

Got it!

I might not be very helpful here, because I very much prefer using graphical interfaces instead of the command line where possible. But now that you mention it, what about git and Bash?

RunDevelopment commented 2 years ago

Something that's been bugging me is how restrictive the descriptions feel

That's actually a good thing, IMO. Tokens have semantic meaning, so it's better for them to narrow than to encompass many different concepts.

We already have a token that is semantically ambiguous: namespace. You can see the problem with this ambiguity on your token page:

image image

foo.bar. and std::sync:: are both namespaces, so why do they look so different? Because styling ambiguous tokens is hard. namespace was originally intended to be used in markup tags, and we just reused it in other languages.

Sure, we still get a pretty color, but it's most likely not what the artist that created the theme intended.

Some words have multiple definitions

Yep, and it's a problem. E.g.: There are multiple languages that have concepts called "tags" that are not markup tags.

hoonweiting commented 2 years ago

We already have a token that is semantically ambiguous: namespace.

Ah hmm... To be really honest I've not used a namespace in HTML before, but looking at the Wikipedia article, I suppose we can move the namespace token out of the markup languages category! And maybe we can rephrase the description slightly, like maybe "A set of names used to identify and refer to objects of various kinds. In XML documents, it provides uniquely named elements and attributes. In non-markup languages, it is used to tokenize the package/namespace part of identifiers."

As for how it looks, maybe this is a ridiculous suggestion, but perhaps we could set the opacity of .namespace to be 0.7 (or whatever value) only for markup languages? (I've been using Tomorrow Night on Prism's site, and it just so happens to be the only official Prism theme that sets a colour and not opacity on .namespace, so I haven't really noticed!) But this is an issue of theming which I'm not sure is within scope?

There are multiple languages that have concepts called "tags" that are not markup tags.

I see, perhaps we can move it out of the markup languages category too. And for tags specifically, that's where I had the problem of restrictive definitions, because, to me at least, when I think of 'tags', I think of labels, like price tags. So aliasing sha_commit to tag made sense to me, since it's kind of a label.

There is also symbol, but I've not used a language with a symbol concept so there's that...

I can submit a PR to make these changes to tokens.html if you're in agreement!

RunDevelopment commented 2 years ago

set the opacity of .namespace to be 0.7 (or whatever value) only for markup languages

You assume that there is a good way to do that :)

Unfortunately, there isn't. There are 2 problems that prevent this from being a simple CSS trick:

  1. Embedded languages.

    A simple languages-specific style might look like this: .language-markup .token.namespace { /* styles */ }. This causes the supposedly markup-specific styles to be applied to all language embedded in markup (e.g. CSS, JS, any language embedded in JS).

    I am not aware of any way to work around this problem.

  2. Languages based on Markup.

    We have many languages that copy tokens from Markup, e.g. JSX, PHP, and all Markup templating languages. These languages only copy markup tokens but not the "markup" name. This makes it difficult to identify them as markup tokens in CSS, since we don't have a language-markup class anywhere.


The best way around this problem might be to narrow down the semantic meaning of standard tokens using aliases. So instead of using just namespace, we could use namespace + markup (alias) to identify a markup namespace.

This is pretty much what we are doing already with semantic aliasing (e.g. function-definition + function, date + number) but we officially support these combinations of standard token + non-standard token. (We need a good name for these combinations. Ideas? Maybe "extended standard token"?)

This would allow standard tokens to be more flexible/vague by making the descriptions/meanings of combinations very specific/narrow.

The best thing about these combinations is that they include a standard token, so they are an opt-in mechanism for themes to support for granular highlighting. This means that we can add as many combinations as we like without breaking any themes.

Thoughts?

So aliasing sha_commit to tag made sense to me, since it's kind of a label.

Really? I would have said that a sha_commit is SHA-1 hash, a number.

hoonweiting commented 2 years ago

You assume that there is a good way to do that :)

Damn, you're right! I would not have noticed the fallacies in my suggestion haha

narrow down the semantic meaning of standard tokens using aliases

Your "extended standard tokens" (gotta sleep on that name) makes a lot of sense, and feels like something (similar to?) you and Josh discussed in April, or at least, what I can grasp of it! It does seem to tie a bit into #2850 as well? I mean, maybe I'm misunderstanding somewhere, but it looks like a way to ease into #2850, and killing two birds with one stone without breaking other things sounds great!

Also, capturing and listing (some of) these non-standard tokens would be useful too, so there would be some consistency/re-use of non-standard tokens across languages in a way.

Besides that, I think I'm really hitting the limits of the level I can think at right now. I'm happy to help with the code/docs to whatever extent I can, but I'm sorry I can't quite contribute to the conversation! :")

I would have said that a sha_commit is SHA-1 hash, a number.

Right, it's a hash. As a layperson, both work for me!

RunDevelopment commented 2 years ago

It does seem to tie a bit into #2850 as well?

You're right, it does. These "extended standard tokens"/token combinations (let's hope this name doesn't stick) are pretty much what I wanted and (if implemented) would resolve #2850.

Also, capturing and listing (some of) these non-standard tokens would be useful too, so there would be some consistency/re-use of non-standard tokens across languages in a way.

Good point.

I'm happy to help with the code/docs to whatever extent I can

Thank you so much!

Besides that, I think I'm really hitting the limits of the level I can think at right now. [...] I'm sorry I can't quite contribute to the conversation! :")

Oh, I think you contributed quite a lot to it already. Thank you!

hoonweiting commented 2 years ago

I don't have a better name for these token pairs yet, but I was wondering if we could avoid naming it altogether. Currently I'm picturing tokens.html to contain these sections (very, very rough descriptions):

This way, it might be possible to avoid naming the "extended standard tokens"! 🤪

Although, giving these "extended standard tokens" an actual name would help a lot in discussions!

joshgoebel commented 2 years ago

What Highlight.js ended up doing:

By having the 1st party/3rd party split we can say "hey lets keep it official in core" and with 3rd party grammars say "do whatever you think is best for your grammar, no rules at all"... I don't think Prism has that luxury though.

RunDevelopment commented 2 years ago

I was wondering if we could avoid naming it altogether.

Naming things is powerful. They aren't just non-standard tokens. A non-standard token is just that, a single token name that isn't a standard token name. However, these combinations form something new, so they need a name.

As @joshgoebel pointed out, these combinations are somewhat similar in function to Highlight.js' sub scopes. So "sub scopes" might be a good inspiration for the name of these combinations, though I'd rather avoid calling them "substandard tokens".


As for the structure of tokens.html: I was imagining something like this: