PrismJS / prism

Lightweight, robust, elegant syntax highlighting.
https://prismjs.com
MIT License
12.33k stars 1.29k forks source link

[Tracking issue] Token types guarantee #3170

Open hoonweiting opened 3 years ago

hoonweiting commented 3 years ago

This is a tracking issue for the second half of #2849!

Long long list of languages ahead:

hoonweiting commented 2 years ago

I'm looking at systemd right now, and I'm noticing some inconsistencies between systemd, INI, and editorconfig.

These three file types generally have key = value. INI and editorconfig considers the = to be punctuation, while systemd considers it an operator. I'm not sure what = should be for all three languages, what do you think?

These three file types also have [SectionName], but the tokens are a little different:

https://github.com/PrismJS/prism/blob/441a14221fca2f2d081e1f44e8cfd0e40bfddb7c/components/prism-editorconfig.js#L4-L13

https://github.com/PrismJS/prism/blob/441a14221fca2f2d081e1f44e8cfd0e40bfddb7c/components/prism-ini.js#L13-L24

https://github.com/PrismJS/prism/blob/441a14221fca2f2d081e1f44e8cfd0e40bfddb7c/components/prism-systemd.js#L15-L25

Officially, editorconfig calls them Section Headers, INI calls them stanza names, and systemd calls them sections. Regardless of the main token name, they do not have consistent aliases. I'm not very sure what to do here!

This is what I'm thinking:

RunDevelopment commented 2 years ago

I'm not sure what = should be for all three languages

Hard to say. Syntactically speaking, = is just a separator token, so punctuation. However, highlighting it as an operator would probably look better in some themes, so I can understand why someone would use operator.

Honestly, I think both operator and punctuation are acceptable here. Just go with which you like best. ¯\_(ツ)_/¯

Officially, editorconfig calls them Section Headers, INI calls them stanza names, and systemd calls them sections.

That depends on who you ask. To my knowledge, there is no official spec for INI, so I don't think we have to worry about names here.

Give all of them their official names (including the [ and ]).

Hmmm. Since the headers and sections all have the same function, I would favor giving them the same name. I would call of all of them "section".

These languages aren't so complex or popular that we have to offer very granular token names, so I think it's ok to go for more generic names.

Alias them to either selector (what INI and systemd use) or keyword (what editorconfig uses), though retaining the existing patterns (so this alias applies only to the SectionName part for INI and systemd, while it applies to the whole [SectionName] for editorconfig).

Good idea.

hoonweiting commented 2 years ago

I'm looking at AutoIt right now (docs), and I'm not very sure about this one:

https://github.com/PrismJS/prism/blob/081d515aa5493ac54bb77071e9f7e387acf6c2ee/components/prism-autoit.js#L21-L25

This will highlight stuff like #include and #NoTrayIcon, but it will miss out #include-once (which is the only directive with a hyphen in its name). Since this is about the regex rather than the token itself, I'm not very sure how to address it.

The rest of it looks fine to me though!

RunDevelopment commented 2 years ago

Huh, that's just a good old bug that needs some fixing. Do you want to make a PR, or shall I?

However, regex bugs don't have anything to do with token types, so you can mark AutoIt as done.

hoonweiting commented 2 years ago

Hmm, I'd like to give it a shot, is this right?

pattern: /(^[\t ]*)#[\w-]+/m
RunDevelopment commented 2 years ago

Yup, that works. Thank you for taking this on.

hoonweiting commented 2 years ago

Hmm, I'm not super sure about this one:

https://github.com/PrismJS/prism/blob/c30b736f53e1fa59ed2f02891a3a9b7feecb5cc3/components/prism-ada.js#L12

Ada simply calls it an attribute, and it is defined in the 2012 Language Reference Manual (page 98 on the version without change bars):

Is attr-name suitable? Or should attr-name (or some other standard token) be used as an alias instead?

RunDevelopment commented 2 years ago

You're right. Some name like attribute + attr-name as its alias would probably be better.

hoonweiting commented 2 years ago

@RunDevelopment I see! I'll make a PR for it.

Also, I just quickly looked through the other languages, I don't think there's any more config-type languages that don't use key and value, except maybe .ignore?

https://github.com/PrismJS/prism/blob/c30b736f53e1fa59ed2f02891a3a9b7feecb5cc3/components/prism-ignore.js#L5-L16

Though it's a slightly different question: is string the most suitable alias?

RunDevelopment commented 2 years ago

Well, ignore (think .gitignore) isn't really a config language like INI or .properties. Ignore just contains file (glob) patterns (plus some comments), so the key-value semantics of other languages don't apply to it.

is string the most suitable alias?

I'm not sure about "most" suitable, but it's a pretty good IMO. Entries in ignore file will be file path and file path patterns, so string seems like a decent choice to me. Since we are talking about patterns, regex would also be fitting.

Apart from string and regex, I can't think of any standard tokens that are semantically close to the entries in ignore files. Got any suggestions?

hoonweiting commented 2 years ago

I guess selector makes some sense in that these file paths and file path patterns are being chosen to be ignored (or not), but I agree that string works well too! I don't think regex is a good idea, but only because the token already allows for regex to be contained inside.

Overall, I think keeping it as it is is alright!

hoonweiting commented 2 years ago

I'm having some doubts about AutoHotkey 😅

First of all, with respect to the token types guarantee, I'm not sure about the following tokens and how they are classified (constant, symbol, and important make sense in terms of classification though). I'll admit, I've never used AutoHotkey myself, so I might not be informed enough here, perhaps there are nuances that I am not seeing:

https://github.com/PrismJS/prism/blob/c30b736f53e1fa59ed2f02891a3a9b7feecb5cc3/components/prism-autohotkey.js#L25-L35

Secondly, based on this Alphabetical Command and Function index, it seems that there are bunch that are missing, such as ComObject() and IsByRef(). I don't mind going through the list and adding them, but I am not sure if they were intentionally left out or something! Do you have any idea?

RunDevelopment commented 2 years ago

I'm not sure about the following tokens and how they are classified

I've never used AutoHotkey myself

Same, but I highly doubt that AutoTrim is a selector... It's probably safe to assume that the tokens you mentioned don't follow the rules for standard tokens.

it seems that there are bunch that are missing

They were probably added to the language after our AutoIt grammar was last updated by someone. This is the reason we typically try to avoid long list of function/type/command names.

Regarding directives (currently important), I think we could just use /#[a-z]+\b/i and save ourselves the work of maintaining another list.

hoonweiting commented 2 years ago

I highly doubt that AutoTrim is a selector...

Same! I really have no idea which token we can use for it. I wanted to call it a function, but it's already used. Then I thought of builtin, but it doesn't quite fit with the existing names in builtin. Maybe they can be combined with keyword? (Maybe I should read the docs more to see if there's a distinction between these two groups though.)

This is the reason we typically try to avoid long list of function/type/command names.

I see your point, do you think it's possible to write another regex that can replace (some of) the lists? I'm not familiar enough with regex to answer this 😅

Regarding directives (currently important), I think we could just use /#[a-z]+\b/i and save ourselves the work of maintaining another list.

I appreciate your wisdom! Would you like to make the PR for it? It would be nice to use directive as the token name instead too, though I'm not sure what standard token it can be aliased to...

hoonweiting commented 2 years ago

Sorry to bring up another topic! I'm looking at Erlang, and it seems pretty alright, but I have a question about the operator token:

https://github.com/PrismJS/prism/blob/cde0b5b25f15465456b6e2ee6049a433b9f9d634/components/prism-erlang.js#L28-L29

I'm not sure whether the keywords there should be classified as operator or keyword. I remember this was mentioned early on in #2849, but I'm not sure what the conclusion is. I see that there's no change to SQL (the original example that was brought up), so I suppose I don't have to update Erlang here either? What do you think?

RunDevelopment commented 2 years ago

Regarding Autohotkey:

I think I'll make a PR with that. Not sure if all of those changes are that great, but they're probably a little better than what we have now. Let's continue the discussion about this topic there.

Would you like to make the PR for it?

Done.

RunDevelopment commented 2 years ago

Sorry to bring up another topic!

Don't worry. You're a great help, so don't hesitate to ask.

I'm not sure whether the keywords there should be classified as operator or keyword.

Honestly, I'm not sure what to do about them myself... They fit into both operator and keyword but using both classes together doesn't work with standard tokens. I think it's okay to just leave Erlang operators as they are right now until we find a good solution.

hoonweiting commented 2 years ago

I think it's okay to just leave Erlang operators as they are right now until we find a good solution.

Sounds great, thank you 😄

hoonweiting commented 2 years ago

I'm currently looking at MEL (Maya Embedded Language), it seems mostly fine, but I have some questions:

MEL has flags, and they are currently aliased to operator in Prism:

https://github.com/PrismJS/prism/blob/31a38d0ca02e6399637a6c693f65f1b007c71c63/components/prism-mel.js#L21-L24

I'm not really sure if operator is the best token for aliasing, I feel like attr-name or property or builtin would be a better fit in terms of meaning. What do you think?

Next, there's code, which highlights code that's contained within backticks. The purpose is defined here (essentially evaluates the code within it and returns a value).

https://github.com/PrismJS/prism/blob/cf38d0590e2a2688aa75ea735f191e75bcab0950/components/prism-mel.js#L3-L14

I don't think italics is a meaningful alias, although I can see how having the code italicized helps a lot in readability. As such, using italics as an alias would allow the code to be italicized without requiring the user to have to have to modify their Prism CSS file at all. But since we're on the topic of using standard tokens appropriately, I wonder if we can give this a pass?

Finally there are built-in commands (very long list! Probably even longer than AutoHotKey!).

https://github.com/PrismJS/prism/blob/cf38d0590e2a2688aa75ea735f191e75bcab0950/components/prism-mel.js#L26

Prism currently calls them functions but I think builtin is a better fit as it is possible for users to define their own functions (or procedures, as MEL calls it). And speaking of user-defined functions, I wonder if it's possible to capture them (at least when they're being defined) with regex?

hoonweiting commented 2 years ago

I'm looking at Flow, and I'm not sure if tag is a good alias for type.

https://github.com/PrismJS/prism/blob/cf38d0590e2a2688aa75ea735f191e75bcab0950/components/prism-flow.js#L5-L10

I'm thinking maybe class-name would be more suitable?

Flow also has function-variable like JavaScript:

https://github.com/PrismJS/prism/blob/cf38d0590e2a2688aa75ea735f191e75bcab0950/components/prism-flow.js#L12

In JS, it's aliased to function. I suppose we can do the same for Flow?

hoonweiting commented 2 years ago

I'm looking at Excel formula:

I wonder if function or builtin would be a better fit than function-name?

https://github.com/PrismJS/prism/blob/cf38d0590e2a2688aa75ea735f191e75bcab0950/components/prism-excel-formula.js#L40-L43

keyword also makes some sense though.

I also see why range and cell have the same alias, property:

https://github.com/PrismJS/prism/blob/cf38d0590e2a2688aa75ea735f191e75bcab0950/components/prism-excel-formula.js#L44-L59

But perhaps selector or variable would be a better fit?

Also, as a note, I haven't really been paying attention to the colours, so there's a chance that the tokens will end up having the same colours in some themes 😅

RunDevelopment commented 2 years ago

I'm not really sure if operator is the best token for aliasing, I feel like attr-name or property or builtin would be a better fit in terms of meaning. What do you think?

I agree, attr-name or property would be better.

Next, there's code

Yeah, italics is weird. I mean, it does highlight code sections well, but I wonder if we could achieve the same result with more convential aliases.

Finally there are built-in commands (very long list! Probably even longer than AutoHotKey!).

Oof. 16kb of built-in commands... Yes, builtin is a better name.

And speaking of user-defined functions, I wonder if it's possible to capture them (at least when they're being defined) with regex?

Procedure calls will be very difficult to detect because the parenthesis are optional. Procedure definition might be doable, depends on the return type. Assuming that the (optional) return type is only a single word, { pattern: /(\bproc\s+(?:\w+\s+)?)[a-zA-Z_]\w*(?=\s*\()/, lookbehind: true } will work.

I'm looking at Flow, and I'm not sure if tag is a good alias for type.

Definitely class-name, you're right.

Flow also has function-variable like JavaScript:

Flow extends JS and changes the regex. So it also inherits the function alias from JS.

image

I'm looking at Excel formula: I wonder if function or builtin would be a better fit than function-name?

I think I did this because Excel only has built-in functions and keyword as still available. You're right, builtin is better.

But perhaps selector or variable would be a better fit?

They are essentially just the names/indices of cells in a table. So I don't think variable fits. I do like selector though.

Also, as a note, I haven't really been paying attention to the colours

Don't worry about it. You win some, you lose some. Having a consistent standard is more important that some languages not looking their best in some themes.

hoonweiting commented 2 years ago

I wonder if we could achieve the same result with more convential aliases.

I see! I'm not sure what's a good one though, I think italics works because it's the token that is closest to guaranteeing font-style: italic. Maybe we'll think of something soon!

16kb of built-in commands...

I was close to offering to update the list, but I'm not sure now :P I was wondering if we could use a pattern instead, but looking at the snippets earlier, I highly doubt that it's possible.

Procedure calls will be very difficult to detect...

I see! I think the function definition would be good enough, consistent with other languages. I presume you'll make a PR for the new regex?

Flow extends JS and changes the regex. So it also inherits the function alias from JS.

Oh right, thanks for catching that! :)

I'll make the PR for Excel and Flow shortly!

RunDevelopment commented 2 years ago

I was wondering if we could use a pattern instead, but looking at the snippets earlier, I highly doubt that it's possible.

Hmmm, maybe. I'll look into it. Since built-in commands/functions and user-defined procs use the same calling syntax, it will detect both, though.

Edit: Yup, worth it. Mel got 90% lighter. #3393

hoonweiting commented 2 years ago

Wow, that's amazing!!! I am once again impressed by your regex skills :D