Custom Token Patterns too inefficient

dhlolo commented 2 years ago

Using RegExp as token pattern seems to be fast, but when I use custom_payload function: function matchCustomToken(text, startOffset) { return REG.exec(text.substring(startOffset)); }. It costs about 20s to solve 500 lines, one and a quarter minutes to solve 1000 lines.

bd82 commented 2 years ago

Hello @dhlolo

There are some optimizations that are only performed when no custom tokens are used. However, these should not cause such a large performance difference.

https://github.com/Chevrotain/chevrotain/blob/bd5c2a20d27df3786b8c748f06cadf0658ab2e65/packages/chevrotain/src/scan/lexer_public.ts#L333-L339

The main thing that could affect the performance in this chase is not automatically using the "starting character optimization" when a custom token is used. See the documentation below how to resolve this:

https://chevrotain.io/docs/guide/performance.html#ensuring-lexer-optimizations

In general even without these optimizations the numbers you posted seems really high.

Are these numbers of lexer only phase or for both lexing and parsing?
What are the numbers like if you use a pure regexp?

bd82 commented 2 years ago

switching to discussion

Chevrotain / chevrotain

Custom Token Patterns too inefficient #1783