Chevrotain / chevrotain

Parser Building Toolkit for JavaScript
https://chevrotain.io
Apache License 2.0
2.44k stars 199 forks source link

Creating a "catch all" token #2017

Closed knpwrs closed 5 months ago

knpwrs commented 5 months ago

I am using Chevrotain to try and make a lexer for the liquid templating language. Consider the following template (including the comment):

<!-- if array = [1,2,3,4,5,6] -->
{% for item in array limit:2 %}
  {{ item }}
{% endfor %}

As a first step, I am making a multi-mode lexer. The first mode, main, has three tokens:

export const ObjectStart = createToken({
  name: 'ObjectStart',
  pattern: /{{-?/,
  push_mode: MODE_OBJECT,
})

export const TagStart = createToken({
  name: 'TagStart',
  pattern: /{%-?/,
  push_mode: MODE_TAG,
})

export const Text = createToken({
  name: 'Text',
  pattern: /[\s\S]+/,
  line_breaks: true,
})

const lexer = new Lexer({
  modes: {
    'main': [ObjectStart, TagStart, Text],
    'object': [/* not relevant to issue */],
    'tag: [/* not relevant to issue */],
  },
  defaultMode: 'main',
})

Already you can see a problem with my Text token in that it will consume everything since ObjectStart and TagStart don't match. Essentially I want to match everything up until either {{ opens a liquid object or {% opens a liquid tag. I've tried /(?!{{|{%)+/ but this pattern matches empty strings. /(.+)(?:{{|{%)?/ appears to work, but in every case, including /[\s\S]+/, I am hitting something that I simply do not understand.

My lexer returns the following errors:

{
  "errors": [
    {
      "column": 1,
      "length": 1,
      "line": 1,
      "message": "unexpected character: -><<- at offset: 0, skipped 1 characters.",
      "offset": 0,
    },
    {
      "column": 2,
      "length": 1,
      "line": 1,
      "message": "unexpected character: ->!<- at offset: 1, skipped 1 characters.",
      "offset": 1,
    },
    {
      "column": 3,
      "length": 1,
      "line": 1,
      "message": "unexpected character: ->-<- at offset: 2, skipped 1 characters.",
      "offset": 2,
    },
    {
      "column": 4,
      "length": 1,
      "line": 1,
      "message": "unexpected character: ->-<- at offset: 3, skipped 1 characters.",
      "offset": 3,
    },
  ]
}

The initial <!-- does not match, and then the tokens start at if array. With pattern set to /(.+)(?:{{|{%)?/, I get the following errors:

  "errors": [
    {
      "column": 34,
      "length": 1,
      "line": 1,
      "message": "unexpected character: ->
<- at offset: 33, skipped 1 characters.",
      "offset": 33,
    },
    {
      "column": 66,
      "length": 1,
      "line": 1,
      "message": "unexpected character: ->
<- at offset: 65, skipped 1 characters.",
      "offset": 65,
    },
    {
      "column": 79,
      "length": 1,
      "line": 1,
      "message": "unexpected character: ->
<- at offset: 78, skipped 1 characters.",
      "offset": 78,
    },
    {
      "column": 92,
      "length": 1,
      "line": 1,
      "message": "unexpected character: ->
<- at offset: 91, skipped 1 characters.",
      "offset": 91,
    },
  ]

Basically every new line is unexpected.

I've also tried a variation on moving the /(.+)(?:{{|{%)?/ pattern to the front of the mode, but that's producing errors of its own.

What is the best way to create a "catch all" token that captures everything up until another token in the current mode would be valid?

Semantically, in a liquid template everything that is outside of an object ({{ }}) or a tag ({% %}) is just text.

EDIT: I've also tried /([\s\S]+)(?:{{|{%)?/ and this appears to also produce the same errors as originally.

bd82 commented 5 months ago

Hello @knpwrs

What is the best way to create a "catch all" token that captures everything up until another token in the current mode would be valid?

Your approach: Greedy Matching and Lookahead

Many of the regexp patterns you have tried seem to allow matching these two characters sequences ({{ {%) followed by optional lookahead assertions:

I have not tested this so there may be other issues but I assume that if the regexp engine is greedy (attempts longest match) which is the default afaik. Then it would match the longest sub-string of the input that fits the pattern instead of the shortest string until the input.

Using non-greedy quantifiers (+? *?) may help, but I'm still wary of the combination with optional lookahead

Suggestion (try this)

My default approach in this case would be to not allow the pattern to match the two characters sequence which marks the beginning of the "meaningful" part of the template. So I would define the "free text" part as a sequence of:

e.g: /([^{]|({[^%{]))+/

Edge Case

There is still an edge case where the last token in the input is a "free Text" which ends with a single { character. And I don't think Chevrotain allows you to include end of input anchor ($) in the pattern regexps. But that could potentially be handled by simple pre-lexing input processing (appending another character if the input ends with {

knpwrs commented 5 months ago

I wound up trying a custom token:

export const Text = createToken({
  name: 'Text',
  line_breaks: true,
  pattern: {
    exec: (text, startOffset) => {
      let endOffset = startOffset
      let charCode = text.charCodeAt(endOffset)
      let nextCharCode = text.charCodeAt(endOffset + 1)

      while (
        !Number.isNaN(charCode) &&
        !Number.isNaN(nextCharCode) &&
        charCode !== OpenBrace &&
        nextCharCode !== OpenBrace &&
        nextCharCode !== PercentSign
      ) {
        endOffset += 1
        charCode = text.charCodeAt(endOffset)
        nextCharCode = text.charCodeAt(endOffset + 1)
      }

      if (endOffset === startOffset) {
        return null
      }

      const match = text.substring(startOffset, endOffset)
      return [match]
    },
  },
})

And I am very confused by this output:

{
  "errors": [
    {
      "column": 34,
      "length": 1,
      "line": 1,
      "message": "unexpected character: ->
<- at offset: 33, skipped 1 characters.",
      "offset": 33,
    },
    {
      "column": 2,
      "length": 1,
      "line": 2,
      "message": "unexpected character: -> <- at offset: 67, skipped 1 characters.",
      "offset": 67,
    },
    {
      "column": 13,
      "length": 1,
      "line": 2,
      "message": "unexpected character: ->
<- at offset: 78, skipped 1 characters.",
      "offset": 78,
    },
    {
      "column": 26,
      "length": 1,
      "line": 2,
      "message": "unexpected character: ->
<- at offset: 91, skipped 1 characters.",
      "offset": 91,
    },
  ],
  "groups": {},
  "tokens": [
    {
      "endColumn": 33,
      "endLine": 1,
      "endOffset": 32,
      "image": "<!-- if array = [1,2,3,4,5,6] -->",
      "startColumn": 1,
      "startLine": 1,
      "startOffset": 0,
      "tokenType": {
        "CATEGORIES": [],
        "LINE_BREAKS": true,
        "PATTERN": {
          "exec": [Function],
        },
        "categoryMatches": [],
        "categoryMatchesMap": {},
        "isParent": false,
        "name": "Text",
        "tokenTypeIdx": 11,
      },
      "tokenTypeIdx": 11,
    },
    {
      "endColumn": 36,
      "endLine": 1,
      "endOffset": 35,
      "image": "{%",
      "startColumn": 35,
      "startLine": 1,
      "startOffset": 34,
      "tokenType": {
        "CATEGORIES": [],
        "PATTERN": /\\{%-\\?/,
        "PUSH_MODE": "tag",
        "categoryMatches": [],
        "categoryMatchesMap": {},
        "isParent": false,
        "name": "TagStart",
        "tokenTypeIdx": 10,
      },
      "tokenTypeIdx": 10,
    },

Why would the line breaks be unexpected? I have line_breaks: true.

I'm also thinking perhaps it would be beneficial for Chevrotain to ship an official Mustache Template Syntax lexer/parser. Mustache is the simplest language I'm aware of for this style of templates and it would demonstrate how to work around this problem for all similar languages.

bd82 commented 5 months ago

line_breaks : true does not make the token able to include line_breaks. Instead it tells Chevrotain that the token may have included line_breaks, so it should update the line/column trackers.

If you want your Text token to handle the new lines, you have to explicitly implement it in your custom token code. Although I suspect your code does handle it.

I suspect you may have a logical bug where your loop halts one index before the expected position, e.g:

You should also test the edge case of a Text token which is the last token in the input.

<!-- if array = [1,2,3,4,5,6] -->
{% for item in array limit:2 %}
  {{ item }}
{% endfor %}
123456
bd82 commented 5 months ago

I'm also thinking perhaps it would be beneficial for Chevrotain to ship an official Mustache Template Syntax lexer/parser. Mustache is the simplest language I'm aware of for this style of templates and it would demonstrate how to work around this problem for all similar languages.

"Official" and "ship" are beyond the scope of the provided examples as most of those are non-productive quality examples...

But a smaller (more focused) example of "catch all" token example PR would be positively reviewed if you are interested in contributing it.

knpwrs commented 5 months ago

My custom pattern wound up being problematic, so I used your suggested pattern and it's working well so far.

I'd love to contribute an example, maybe after this project wraps and I gain some confidence in how it's all working together.

Thank you for your help!