elixir-makeup / makeup_erlang

Erlang lexer for Makeup
3 stars 6 forks source link

No need for lexeme #7

Open tmbb opened 5 years ago

tmbb commented 5 years ago

https://github.com/tmbb/makeup_erlang/blob/04751d8ce7dba517dfa9bc6cf6a0076071e8cd41/lib/makeup/lexers/erlang_lexer.ex#L125-L131

The only reason we might need the lexeme is if we need to pattern match on variable names later, for grouping purposes, for example. I believe I do that in the Elixir lexer (I haven't looked at the source in some time).

tmbb commented 5 years ago

There is a way around using the lexeme() combinator. For example, in the Elixir lexer the lexeme() lexer is used in some places so that we can pattern match on binaries later. For example, every lowercase name is parsed as a variable name and later turned into a keyword if the binary matches a keyword in the list.

This doesn't work in a simple way if the text part of the token ({type, meta, text}) is not guaranteed to be a binary. We'd have to keep in the source code a list of the iolists into which the keyword names are parsed. For example, case might be parsed into something like ['c', 'a', 's', 'e']. This would have to be entered into the source code manually, which is error prone and boring.

The interesting way of solving this is to parse the keywords at compile time. For example:

@keywords Internal.parse_words_into_iolist(["do", "end", "def", ...])

That would require moving the actual parser into an ElixirLexer.Internal module so that it could be used in the ElxirLexer module. While it does complicate the code a little, I believe the performance benefits would be considerable, because it avoids building many binaries during the lexing. We don't have to believe, though. We can measure it directly. My Schism library makes it particularly easy, and I'll benchmark it as soon as possible.

tmbb commented 4 years ago

@josevalim and @mracos what do you think of this? Would you be in favor of making the implementation slightly more complex if it resulted in speed gains?

mracos commented 4 years ago

Will take a look on it 😄

tmbb commented 4 years ago

Will take a look on it smile

You don't have to. I can do it myself when I have the time. But if you understood what I meant from the description above you can try to implement it.

mracos commented 4 years ago

Actually I should be more explicit, will take a look on the problem and see if I can understand 😅

tmbb commented 4 years ago

Basically the problem is that for some tokens you want something you can match on. You can match on iolists just fine (just like you can match on binaries, for keywords, builtins, etc). The problem is that it's hard to see which iolist will be generated by the lexer for a given token. The easiest way to see which ioist will be parsed from a token is to run the lexer on the keyword.

Let's say you want to parse the case keyword in erlang or elixir. Maybe your lexer returns something like {ttype, meta, [?c, ?a, ?s, ?e]} instead of {ttype, meta, "case"}.But in the source code, you want to write "case" instead of the complex alternative. The way to transform "case" into whatever is returned by the lexer is to "lex" the "case" binary and ignore the ttype and meta parts. That way, you can build the function heads at compile time. Because the function heads need to be built at compile time, you need to separate the actual lexer into another module, so that the public-facing module can call the lexer on the lists of keywords at compile time.

This might improve performance because you don't need to build binaries when parsing anymore. You can now keep everything as an iolist instead of turning them into binaries to make it easier to postprocess. I can't guarantee it will be faster, but it's worth it to try (maybe it might even be slower because matching iolists might be slower than matching binaries (I have no idea, I really have to benchmark it)