PrismJS / prism

Lightweight, robust, elegant syntax highlighting.
https://prismjs.com
MIT License
12.23k stars 1.29k forks source link

[Brainstorming] Grammars with custom tokenization #3539

Open RunDevelopment opened 2 years ago

RunDevelopment commented 2 years ago

Motivation I'm currently working on migration JS templates for v2. The language is really useful, but it is broken, because embedded languages are fundamentally broken right now:

Any language that relies on hooks to do tokenization cannot be embedded. This includes all languages that use Markup Templating (e.g. PHP).

Description The general feature I want to request/brainstorm here is a way for grammars to provide a custom tokenizer on a grammar level.

The "grammar level" part is important. We have to be able to use them with inside and rest in existing Prism grammars. This is the key distinction from the current hook-based approach, where custom tokenization happens on a highlighting level (literally within Prism.highlight).

Ideas

  1. We could redefine grammars like this type Grammar = PrismGrammar | CustomGrammar, where Prism grammars are what we currently use and custom grammars are just a function that takes some code and returns a token stream (type CustomGrammar = (code: string) => TokenStream).

    The main advangtage of this approach is that it is very simple. However, it doesn't work well with existing practices like extending languages. To fix this, we could:

  2. Make custom tokenizers a property of grammars. Similar to rest, we could have a tokenize property/method for grammars that defines the custom tokenizer. So the type of grammars would be this (pseudo code):

    interface Grammar {
      // the actual tokens
      rest?: Grammar;
      tokenize?: (this: Grammar, code: string) => TokenStream;
      // Note: the grammar might also be passed via a regular argument instead of `this`, not sure yet.
    }

    The main advantage of this appraoch is that it adds to what existing grammars can do.

    Languages that extend a grammar with a custom tokenizer can also change the custom tokenizer by changing its grammar token. This is very important, because the custom tokenizer might reference a language and we might have to change this reference.

    E.g. JS templates references both the embedded language (for obvious reasons) and JS (for interpolation expressions in template strings). The TypeScript grammar extends JS, so we need to be able to change the reference to JS in template strings to TS. With this approach, our current extend function will automatically change the JS references to TS references, which is nice.

    However, the main downside with this approach is that it complicates what grammars are. Grammars used to be purely declarative, but with this idea, grammars become more imperative, object-like entities. Of course, this only applies to grammars with custom tokenizers.

Number 2 is a lot more useful than number 1, but it also makes grammars object-like which I'm not sure I like.


This is a pretty major addition, so I would really like to hear your thoughts on this.

@LeaVerou @mAAdhaTTah @Golmote @JaKXz

RunDevelopment commented 2 years ago

So I went ahead and implemented number 2 (#3541). It works pretty well, so I'm probably going to go with this. However, there's still some time until we release v2, so please share your thoughts on this.