Provide a enableRecovery option for the lexer

Chevrotain / chevrotain

Parser Building Toolkit for JavaScript

https://chevrotain.io

Apache License 2.0

2.44k stars 200 forks source link

Provide a enableRecovery option for the lexer #1838

Closed jonestristand closed 1 year ago

jonestristand commented 1 year ago

Useful to be able to disable recovery for the lexer. Currently it will skip input characters until it finds an offset that matches a token again, gives an error, and continues tokenizing. In some applications it would be desirable to stop lexing if no suitable token is found at the current offset.

msujew commented 1 year ago

I'm not sure that this would be the right approach to this issue. Just to abort lexing you can probably build a lexer like this:

const AnyOtherToken = createToken({
  name: "AnyOtherToken",
  pattern: () => throw new Error()
})
const Digit = createToken({ name: "Digit", pattern: /[0-9]/ })
const Whitespace = createToken({
  name: "Whitespace",
  pattern: /\s+/,
  group: Lexer.SKIPPED
})

const customPatternLexer = new Lexer([Whitespace, Digit, AnyOtherToken])

This will throw an error while lexing if it encounters any character which doesn't match either the Whitespace or Digit token.

jonestristand commented 1 year ago

Hey thanks for the reply! My thought is that the solution proposed by the PR attached to this issue would be preferable because:

Follows the same idiom/paradigm as the Chrotain parser already uses
Will return all lexed tokens up to the failure point
Doesn't require any 'dummy' tokens to implement - all tokens have semantic meaning
Essentially no overhead from an efficiency point-of-view
Allows users to control some of the assumptions that are built-in to the lexer
It's entirely non-breaking

It's not clear to me why having a throwing token would be preferable, but I'd love to be enlightened!

msujew commented 1 year ago

Alright, sounds reasonable. I'll be looking into.

bd82 commented 1 year ago

Hello @jonestristand

This feature request sounds logical and possible. Can you help me understand your use case? Are you dealing with many (large?) inputs where most of which are invalid and the current behavior is causing time to be wasted on inputs that have already been identified as irrelevant?

jonestristand commented 1 year ago

@bd82 I've included an example use case in the attached PPR (#1839) - but yes, I have very large files (accounting ledgers of several thousand lines) and would prefer not to recover if the input isn't strictly valid.

bd82 commented 1 year ago

Hi @jonestristand

@msujew approved and merged your PR.

I will release a new version during this week or the weekend...

Cheers. Shahar.

bd82 commented 1 year ago

re-opening this until a new version is released

jonestristand commented 1 year ago

Thanks guys, appreciate your considering this change!

bd82 commented 1 year ago

See:

For small changes to this PR

bd82 commented 1 year ago

released in 10.2.0