Keywords not being chosen over Identifiers

ccastillo232 commented 1 month ago

I am seeing an issue where I can not get the Lexer to recognize a keyword over an identifier. I am working based on the example here: https://github.com/chevrotain/chevrotain/blob/master/examples/lexer/keywords_vs_identifiers/keywords_vs_identifiers.js

My test case is this:

it.only('Reproduce chevrotain example', () => {
      const Identifier = createToken({
      name: "Identifier",
      pattern: /[a-zA-z]\w+/,
      });

      const While = createToken({
      name: "While",
      pattern: /while/,
      // longer_alt: Identifier,  // prefer the While token over the Identifier token
      });

      const keywordsVsIdentifiersLexer = new Lexer([While, Identifier], {});
      const tokenResult = keywordsVsIdentifiersLexer.tokenize('textwhiletest');

      expect(tokenResult.errors.length, `Got at least 1 error: ${tokenResult.errors[0]?.message}`).toBe(0);
      expect(tokenResult.tokens.length).toBe(3);
      expect(tokenResult.tokens[0].tokenType.name).toBe(Identifier.name);
      expect(tokenResult.tokens[1].tokenType.name).toBe(While.name);
      expect(tokenResult.tokens[2].tokenType.name).toBe(Identifier.name);
   })

This fails because it is recognizing only 1 Identifier, when I expect it the token vector to be ['text','while','text'].

I am using cevrotain version 10.5.0 so that I can test it with Jest.

Am I missing something?

msujew commented 1 month ago

Hey @ccastillo232,

the regex used /[a-zA-z]\w+/ is eager - as long as it doesn't encounter a delimiter (i.e. as long as there are more \w characters to read) it will continue lexing until the end of the text. Internally, Chevrotain is just using the regex engine of the runtime and behaves just like any regex would on the input.

As such, this behavior is exactly within expectation. You will either need to limit your regex to be less eager or use delimiters in your input. Most languages just use whitespace for that ;)

ccastillo232 commented 1 month ago

Thank you for addressing this. It does make sense, although it presents me with a problem for my particular use case. I'll have to get creative.

ccastillo232 commented 1 month ago

Closing

Chevrotain / chevrotain

Keywords not being chosen over Identifiers #2029