Chevrotain / chevrotain

Parser Building Toolkit for JavaScript
https://chevrotain.io
Apache License 2.0
2.44k stars 199 forks source link

Keywords not being chosen over Identifiers #2029

Closed ccastillo232 closed 1 month ago

ccastillo232 commented 1 month ago

I am seeing an issue where I can not get the Lexer to recognize a keyword over an identifier. I am working based on the example here: https://github.com/chevrotain/chevrotain/blob/master/examples/lexer/keywords_vs_identifiers/keywords_vs_identifiers.js

My test case is this:

it.only('Reproduce chevrotain example', () => {
      const Identifier = createToken({
      name: "Identifier",
      pattern: /[a-zA-z]\w+/,
      });

      const While = createToken({
      name: "While",
      pattern: /while/,
      // longer_alt: Identifier,  // prefer the While token over the Identifier token
      });

      const keywordsVsIdentifiersLexer = new Lexer([While, Identifier], {});
      const tokenResult = keywordsVsIdentifiersLexer.tokenize('textwhiletest');

      expect(tokenResult.errors.length, `Got at least 1 error: ${tokenResult.errors[0]?.message}`).toBe(0);
      expect(tokenResult.tokens.length).toBe(3);
      expect(tokenResult.tokens[0].tokenType.name).toBe(Identifier.name);
      expect(tokenResult.tokens[1].tokenType.name).toBe(While.name);
      expect(tokenResult.tokens[2].tokenType.name).toBe(Identifier.name);
   })

This fails because it is recognizing only 1 Identifier, when I expect it the token vector to be ['text','while','text'].

I am using cevrotain version 10.5.0 so that I can test it with Jest.

Am I missing something?

msujew commented 1 month ago

Hey @ccastillo232,

the regex used /[a-zA-z]\w+/ is eager - as long as it doesn't encounter a delimiter (i.e. as long as there are more \w characters to read) it will continue lexing until the end of the text. Internally, Chevrotain is just using the regex engine of the runtime and behaves just like any regex would on the input.

As such, this behavior is exactly within expectation. You will either need to limit your regex to be less eager or use delimiters in your input. Most languages just use whitespace for that ;)

ccastillo232 commented 1 month ago

Thank you for addressing this. It does make sense, although it presents me with a problem for my particular use case. I'll have to get creative.

ccastillo232 commented 1 month ago

Closing