elixir-makeup / makeup_elixir

Elixir lexer for Makeup
BSD 2-Clause "Simplified" License
33 stars 12 forks source link

Add support for unicode in variable names, atoms and module names #13

Closed tmbb closed 4 years ago

tmbb commented 4 years ago

The naïve way of adding unicode support almost doubles the lexer's compilation time. Depending on the concrete implementation, lexer compilation goes from 20s (which is already a lot) to 32 or 38s.

Name                                             ips        average  deviation         median         99th %
Lexer compilation time - ascii                0.0499        20.05 s     ±0.00%        20.05 s        20.05 s
Lexer compilation time - utf8 modified        0.0304        32.85 s     ±0.00%        32.85 s        32.85 s
Lexer compilation time - utf8                 0.0265        37.76 s     ±0.00%        37.76 s        37.76 s

Comparison:
Lexer compilation time - ascii                0.0499
Lexer compilation time - utf8 modified        0.0304 - 1.64x slower
Lexer compilation time - utf8                 0.0265 - 1.88x slower

Is this an acceptable price for a more correct implementation? It's possible that there exists a more efficient way and I've just not found it.

josevalim commented 4 years ago

Is the compilation issue caused by repeating the same patterns over and over again? NimbleParsec does give the trade-off of increasing runtime by compiling subpatterns into defparsecs. Would that help?

tmbb commented 4 years ago

The "utf8 modified" benchmark is the one with all the tricks to make it faster... Including dividing the enormous character patterns into several parsecs.

josevalim commented 4 years ago

I see :( honestly it feels 40 seconds is a bit too much. Since we can assume mostly valid Elixir code, perhaps it makes more sense to parse everything until it is something that is not a dot, parens, space, plus, etc?

tmbb commented 4 years ago

I see :( honestly it feels 40 seconds is a bit too much.

It's 95 seconds when I add proper unicode support for atoms. I can't even reuse the same parsecs because they are subtly different (i.e. an atom supports a slightly different set of characters from variables).

Since we can assume mostly valid Elixir code, perhaps it makes more sense to parse everything until it is something that is not a dot, parens, space, plus, etc?

I don't know, we could just as well assume that most identifiers will have ASCII characters and revert to the previous simpler implementation.

josevalim commented 4 years ago

Yeah, I would go or with ASCII or use exclude the other valid characters as mentioned above. Excluding is fine because Elixir will reject anything else anyway. I think having to wait 2 minutes for publishing the docs of your first project is definitely too much. Another idea is to implement regex support, as you mentioned in the forum, and use regexes for this particular task.

tmbb commented 4 years ago

Ok, it turns out that compilation time is ~35s and not 90s (I have no idea why I got that freak result before). I've benchmarked three implementations:

Benchmarks below (courtesy of my schism library, benchee and the benchee markdown format):

Benchmark

Benchmark run from 2020-09-30 11:07:35.179159Z UTC

System

Benchmark suite executing on the following system:

Operating System Linux
CPU Information Intel(R) Core(TM) i7-6700HQ CPU @ 2.60GHz
Number of Available Cores 8
Available Memory 7.87 GB
Elixir Version 1.10.4
Erlang Version 23.0.3

Configuration

Benchmark suite executing with the following configuration:

:time 5 s
:parallel 1
:warmup 2 s

Statistics - Lexer Performance

Run Time

Name IPS Average Devitation Median 99th %
Lexer performance - ascii only 137.10 7.29 ms ±7.37% 7.23 ms 9.17 ms
Lexer performance - parsec 132.52 7.55 ms ±8.50% 7.55 ms 8.98 ms
Lexer performance - regex 0.0717 13949.59 ms ±0.00% 13949.59 ms 13949.59 ms

Comparison

Name IPS Slower
Lexer performance - ascii only 137.10  
Lexer performance - parsec 132.52 1.03x
Lexer performance - regex 0.0717 1912.48x

Statistics - Lexer Compilation

Run Time

Name IPS Average Devitation Median 99th %
Lexer compilation time - ascii only 0.0586 17.07 s ±0.00% 17.07 s 17.07 s
Lexer compilation time - regex 0.0549 18.21 s ±0.00% 18.21 s 18.21 s
Lexer compilation time - parsec 0.0271 36.90 s ±0.00% 36.90 s 36.90 s

Comparison

Name IPS Slower
Lexer compilation time - ascii only 0.0586  
Lexer compilation time - regex 0.0549 1.07x
Lexer compilation time - parsec 0.0271 2.16x

The benchamarks use my schism library to easily recompile the code at runtime according to the version we want.

In the new branch, there are two places where conditional compilation is used:

Let's dissect the first of these places:

  # Different implementations of the `variable_name` combinator
  schism "parsec vs regex" do
    dogma "parsec" do
      defparsecp(:variable_start_chars, utf8_char(Utf8Utils.variable_start_chars()))
      defparsecp(:variable_continue_chars, utf8_char(Utf8Utils.variable_continue_chars()))

      variable_name =
        parsec(:variable_start_chars)
        |> repeat(parsec(:variable_continue_chars))
        |> optional(utf8_char(Utf8Utils.variable_ending_chars()))
    end

    heresy "regex" do
      defregexparsec(
        :variable_name,
        # Not 100% correct, but a pretty good approximation
        ~R/[_\p{Ll}\p{Lm}\p{Nl}][_p\{L}\p{Mn}\p{Mc}\p{Nd}\p{Pc}]*[\?!]?/
      )

      variable_name = parsec(:variable_name)
    end

    heresy "ascii only" do
      variable_name =
        ascii_string([?a..?z, ?_], 1)
        |> optional(ascii_string([?a..?z, ?_, ?0..?9, ?A..?Z], min: 1))
        |> optional(ascii_string([??, ?!], 1))
    end
  end

In the "parsec" dogma (the default version, that's why it's the dogma), we use the Utf8Utils.variable_start_chars/1 and Utf8Utils.variable_continue_chars/1 to return the exact set of characters supported by Elixir (these functions return a static list of characters; they are generated by a script that queries the unicode_set package but they don't actually depend on this package at runtime). They are encapsulated in a parsec so that NimbleParsec doesn't waste any time trying to optimize them. The combinator is simply a literal conversion of the BNF grammar in the Elixir syntax description.

In the "regex" heresy (a heresy means that it will only be compiled if we set the appropriate configuration option - in schism parlance we convert/1 to a belief). It uses my custom defregexparsec macro to generate a NimbleParsec-compatible function that behaves like a parsec. Is you pay attention you'll see that the character classes are not exactly the same as the one Elixir supports, but it«s close enough for something like Makeup. I believe the defregexparsec is as optimized as possible without writing my own regex engine, but still, it turns out that the erlang :re module is quite slow, which is understandable, as it's not optimized for the kind of context-free-parsing with limited lookahead and no lookbehind that NimbleParsec excels at.

In the "ascii only" I implement the current version of the lexer, which supports only a very limited set of characters. It's fast in terms of both compilation and lexer performance.

It's quite interesting that although the "parsec" version takes a long time to compile, it's not actually any slower than the version that only recognizes ASCII characters. This probably means that the BEAM pattern matcher is very smart and very fast, and it takes as long to match a very limited set of patterns as it takes to match a huge number of patterns. It also menas tha NimbleParsec is very good at leveraging the BEAM's pattern matcher to make code run fast.

I've tried all kinds of tricks to reduce compilation time for the "parsec" version, but it's not been easy... It looks like the compilation time is proportional to the number of function heads. I'll try to write a custom parsec that defers some of these tests to runtime, but it won't be easy and we might find it has important performance penalties.

Conclusion

The "regex" parser's performance is unacceptable. Unless I can make it frastically faster it's not usable for makeup.

The "parsec" parser is fast, but if we decide that 35s for compilation is too much, it's stil not usable and we must keep the current version.

josevalim commented 4 years ago

This code takes 12 seconds to compile for me:

  defparsecp(:variable_start_chars, utf8_char(Utf8Utils.variable_start_chars()))
  defparsecp(:variable_continue_chars, utf8_char(Utf8Utils.variable_continue_chars()))

  defparsec :variable,
    parsec(:variable_start_chars)
    |> repeat(parsec(:variable_continue_chars))
    |> optional(utf8_char(Utf8Utils.variable_ending_chars()))

  defparsecp(:atom_start_chars, utf8_char(Utf8Utils.atom_continue_chars()))
  defparsecp(:atom_continue_chars, utf8_char(Utf8Utils.atom_continue_chars()))

  defparsec :atom,
    parsec(:atom_start_chars)
    |> repeat(parsec(:atom_continue_chars))
    |> optional(utf8_char(Utf8Utils.atom_ending_chars()))

I have an idea. What if instead of using parsec to build these large utf8 combinators, we use repeat_while to see if we should continue parsing? It would be something like this:

repeat_while(utf8_char([]), :atom_continue)

defp atom_continue(rest, context, _, _) do
  if Utf8Utils.atom_continue?(rest), do: {:cont, context}, else: {:halt, context}
end

And atom_continue would compile to a bunch of clauses like this:

def atom_continue?(<<?a::utf8, _::binary>>), do: true
def atom_continue?(<<?b::utf8, _::binary>>), do: true
...
def atom_continue?(_), do: false

Which should be faster to compile. We can use the repeat_while approach for everything except variable_start. Thoughts?

josevalim commented 4 years ago

I have also noticed sigils are expensive to compile, partly due to generating a clause for each possible sigil delimiter. Should we instead generate sigils like this:

start = choice([""", "(", ...])
stop = choice([""", ")", ...])

And then use something like post_traverse to actually verify that they match?

tmbb commented 4 years ago

have an idea. What if instead of using parsec to build these large utf8 combinators, we use repeat_while to see if we should continue parsing? It would be something like this:

I'm exploring that as well. But I don't get why I'm getting such randomly different compile times in different test runs... Sometimes I get compile times around 17s (in the same ballpark as what you get), and sometimes as high as 80s. What are the specs of your machine?

tmbb commented 4 years ago

And then use something like post_traverse to actually verify that they match?

I've thought of that already, but I havent tried it yet.

tmbb commented 4 years ago

And atom_continue would compile to a bunch of clauses like this:

def atomcontinue?(<<?a::utf8, ::binary>>), do: true def atomcontinue?(<<?b::utf8, ::binary>>), do: true ... def atomcontinue?(), do: false

Which should be faster to compile. We can use the repeat_while approach for everything except variable_start. Thoughts?

I'm exploring ways of matching the character in "user code" using a binary search tree instead of multiple function heads. It's probably faster to compile and maybe it won't be much slower.

josevalim commented 4 years ago

I'm exploring ways of matching the character in "user code" using a binary search tree instead of multiple function heads. It's probably faster to compile and maybe it won't be much slower.

That's what the Erlang VM does and it does it really well. So I would just go with the multiple clauses (or a large case which will be considerably faster to compile).

EDIT:

def atom_continue?(binary) do
  case binary do
    <<?a::utf8, _::binary>> -> true
    <<?b::utf8, _::binary>> -> true
    _ -> false
  end
end
tmbb commented 4 years ago

Are you working on my new branch? The one that uses Schism? Could you please run the following command: mix run benchee/schism.exs and post the console output here? One of the benchmaks recompiles the lexer file and measures it.

So that we are measuring the compilation time the same way. Maybe you're just using a faster machine, or maybe its the fact that I'm running elixir on WLS (windows subsystem for linux).

josevalim commented 4 years ago

Benchmark

Benchmark run from 2020-10-01 14:32:36.727146Z UTC

System

Benchmark suite executing on the following system:

Operating System macOS
CPU Information Intel(R) Core(TM) i5-1038NG7 CPU @ 2.00GHz
Number of Available Cores 8
Available Memory 16 GB
Elixir Version 1.12.0-dev
Erlang Version 22.3.4.1

Configuration

Benchmark suite executing with the following configuration:

:time 5 s
:parallel 1
:warmup 2 s

Statistics

Run Time

Name IPS Average Devitation Median 99th %
Lexer compilation time - ascii only 0.0749 13.35 s ±0.00% 13.35 s 13.35 s
Lexer compilation time - regex 0.0683 14.64 s ±0.00% 14.64 s 14.64 s
Lexer compilation time - parsec 0.0346 28.87 s ±0.00% 28.87 s 28.87 s

Comparison

Name IPS Slower
Lexer compilation time - ascii only 0.0749  
Lexer compilation time - regex 0.0683 1.1x
Lexer compilation time - parsec 0.0346 2.16x

tmbb commented 4 years ago

That's still a lot slower than what 12s and more similar to my times. Did you have many other programs running at the same time? I've found out it makes compilation time slower in my laptop.

josevalim commented 4 years ago

@tmbb oh, to be clear, I meant to say that only the code snippet I posted above takes 12 seconds and that is 40% of the total time. So if we can take 7 seconds out of it, then that will be a major improvement altogether. :)

josevalim commented 4 years ago

@tmbb so I decided to try my approach out and once I transformed utf8_util into a bunch of conditionals, then it took 10s to compile it while the lexer took about 15 seconds. That's when I realized: we can speed up simply by making the compilation parallel into multiple modules!

I also noticed that your generation script can be written with nimble_parsec and we can bypass the intermediate module altogether by using nimble_parsec.compile. I pushed all of this work here:

https://github.com/elixir-makeup/makeup_elixir/compare/utf8-identifier-benchmarks...jv-speed-up-unicode?expand=1

On my machine it takes about 16 seconds (from 27 compared to your unicode branch) and that will hold true if you have 2+ cores. It depends on NimbleParsec master which I pushed in a rush. I have to go back there, add some tests, and do a new release.

If we improve the sigil handling, we may make this go under 10 seconds. Alternatively we can move the sigil handling to a separate module too, using the features from NimbleParsec master.

tmbb commented 4 years ago

@tmbb so I decided to try my approach out and once I transformed utf8_util into a bunch of conditionals, then it took 10s to compile it while the lexer took about 15 seconds.

I confirm that it's faster that way.

I also noticed that your generation script can be written with nimble_parsec and we can bypass the intermediate module altogether by using nimble_parsec.compile.

This is very cool. I had totally forgotten about nimble_parsec.compile. It's probably the way to go.

Alternatively we can move the sigil handling to a separate module too, using the features from NimbleParsec master

Why do you need NimbleParsec master?

EDIT: there's a mistake in the atom unicode patterns which I'll fix. Don't merge your branch right now.

josevalim commented 4 years ago

You can only call parsec for functions defined in the same module. Master allows public combinators and remote combinators.

josevalim commented 4 years ago

@tmbb please Do any improvements You want and merge my branch at your discretion!

josevalim commented 4 years ago

PR sent here: #15. Feel free to do your changes and merge. :)

tmbb commented 4 years ago

Latest benchmarks

For these benchmarks I've done the following changes (sorry for mixing them all in the same commit, but I just went along with the changes and when I noticed I had this):

System

Benchmark suite executing on the following system:

Operating System Linux
CPU Information Intel(R) Core(TM) i7-6700HQ CPU @ 2.60GHz
Number of Available Cores 8
Available Memory 7.87 GB
Elixir Version 1.10.4
Erlang Version 23.0.3

Configuration

Benchmark suite executing with the following configuration:

:time 5 s
:parallel 1
:warmup 2 s

Lexer Performance

Run Time

Name IPS Average Devitation Median 99th %
Lexer performance - external 146.39 6.83 ms ±6.48% 6.82 ms 8.26 ms
Lexer performance - internal 141.31 7.08 ms ±4.55% 7.02 ms 8.39 ms

Comparison

Name IPS Slower
Lexer performance - external 146.39  
Lexer performance - internal 141.31 1.04x

Project Compilation (all files, not only the lexer)

Run Time

Name IPS Average Devitation Median 99th %
Project compilation time - external 0.159 6.28 s ±0.00% 6.28 s 6.28 s
Project compilation time - internal 0.0738 13.55 s ±0.00% 13.55 s 13.55 s

Comparison

Name IPS Slower
Project compilation time - external 0.159  
Project compilation time - internal 0.0738 2.16x

Discussion

In the "internal" case the sigils are compiled in the ElixirLexer module. In the "external" case they are compiled in a different module, which allows the parallel compiler to distribute the compilation work between several cores. This leads to a very large peformance impact during compilation (compilation time is reduced by 50%)

There is no performance penalty for splitting the sigils into another module (I start to think that for complex parsers there isn't such a great performance advantage of using combinators instead of parsecs).

Splitting the sigils comes with some code organization challenges, because inside a sigil you can have pretty much any element from the lexer because of interpolation. To handle that I had to create some public combinators in the ElixirLexer module which are invoked inside the Sigils module. At the same time, the ElixirLexer invokes a combinator from the Sigils module. Because these dependencies only matter at runtime and not at compile time, this is not such a big deal, but it's inelegant. Despite the lack of elegance, the performance impact is so big that I chose to keep it.

I didn't split the Atom and Variable modules further because it didn't result in further compilation speed improvements (the relevant schisms have been deleted, otherwise the source would have become really confusing).

These benchmarks were done with Schism but for the final release and before merging into master I will remove the dependency on Schism. Sadly I haven't found a way of making Schism a :dev-only dependency.

EDIT: just to give credit where it's due for those that might be following, the big performance improvements all come from @josevalim's idea of splitting the code into several modules to leverage the Elixir parallel compiler.

josevalim commented 4 years ago

Awesome @tmbb! After my PR #14 was merged, I don't think it is worth splitting the sigils because they compile very quickly, so I would keep atoms, variables, and the main lexer as the modules.

Given the slowest of them is the variables one, which takes 7 seconds, it will be the one driving the compilation time. So unless we break the variable one apart in two modules, the atom one in two modules, there is no benefit to break the main lexer, as compilation will still take 7 seconds.

So what do you think merging #15 and then you push your improvements on top?

(I start to think that for complex parsers there isn't such a great performance advantage of using combinators instead of parsecs).

There is a benefit, which is less memory allocation, but unlikely to matter unless we are handling dozens of MBs per tokenization, which doesn't matter for makeup.

tmbb commented 4 years ago

Awesome @tmbb! After my PR #14 was merged, I don't think it is worth splitting the sigils because they compile very quickly, so I would keep atoms, variables, and the main lexer as the modules.

I don't get it. I'm just showing you a benchmark that compares compilation time between splitting the sigils and keeping them in the main order and I get performance benefits on my machine.

I don't think you can compare the 7s on your machine to the 7s on my machine, right?

If you give me some time I can setup some shchism benchmarks comparing splitting the Variable and Atom modules but I don't see why you think splitting the sigils has no effect

josevalim commented 4 years ago

I don't get it. I'm just showing you a benchmark that compares compilation time between splitting the sigils and keeping them in the main order and I get performance benefits on my machine.

I am saying it because I tried this approach in my machine locally and it didn't yield any result. At least for me, the variables module takes the longest to compile, so optimizing any other module will not make the overall compilation faster because everyone ends up waiting on the variables module anyway. :)

In any case, what do you think about merging the current PR, and we do further benchmarks and improvements on top of it? Unless you think the current version is not acceptable and we need to improve it before merge. I will also release a new nimble_parsec. WDYT?

tmbb commented 4 years ago

what do you think about merging the current PR, and we do further benchmarks and improvements on top of it?

Are you talking about meeting the version with the split the sigils or the one with sigils in the main leader module?

josevalim commented 4 years ago

@tmbb I am talking about PR #15 :)

josevalim commented 4 years ago

Btw, do you have a branch with your sigils schism?

tmbb commented 4 years ago

Btw, do you have a branch with your sigils schism?

Yes, this one: https://github.com/elixir-makeup/makeup_elixir/tree/origin/jv-speed-up-unicode

Sorry for the weird name, it was a mistake when setting the upstream branch and I didn't want I to mess with the history too much. You can reproduce the benchmarks using that branch.

josevalim commented 4 years ago

Cool! Since it is built on top of #15, I will go ahead and merge #15 and we can continue exploring in other branches. :D

josevalim commented 4 years ago

Here are some tests that I did.

$ time mix run lib/makeup/lexers/elixir_lexer.ex takes 4.8 seconds. If I comment out the sigils line, it takes 2.9 seconds. So that's a good improvement but it doesn't speed up compilation because variables.ex take 7 seconds. If I break variables apart, then splitting the sigils out help, because "variable_continue" take 4 seconds, but this will only be visible in machines with 4+ cores.

One thing to note is that you are running on Erlang 23.0 and it had some regressions in compilation times. Those have been fixed on 23.1. In particular, trying out the OTP master branch (with JIT), the variable module compiles in 4.5 seconds and elixir_lexer.ex compiles in 4 seconds and the whole project takes 4.5 seconds.

With all this new data, my honest opinion is to ship as is. I don't think it is necessary to split either variables.ex or sigils apart. But I will be glad to go with whatever you prefer. :) I will release a new nimble_parsec so we can ship a new makeup_elixir release anytime you want.

tmbb commented 4 years ago

I find it very strange that splitting the sigils doesn't help to increase performance on your machine. In my machine it's the only change that actually speeds up compilation. Splitting the Variables and the Atom modules hasn't actually sped up anything. I've setup a Schism benchmark that tries all 4 posible combinations (sigils in the lexer module or in their own module and identifiers -variables and atoms - split into 2 or 4 modules). I have to add the possibility of testing all possible schism/belief combinations, because the code the code that generates all combinations is not exactly trivial.

I think the performance differences we're seeing are related to the different OTP versions. I can't think of any other probable cause. Clearly in my setup, the ElixirLexer is the bottleneck (and thus splitting the sigils into their own module makes a big difference) while in your it's the Variable module (and that's why splitting the sigils doesn't lead to a performance increase in your case). I have to setup Erlang 23.1 and try it for myself.

Meanwhile, I think you can ship as is. The complexity of splitting the sigils is probably not worth it if it doesn't lead to a consistent performance improvement. I'll continue to try to find ways of optimizing it further (and in the meantime update my Schism library with some utilities I've found useful during our exploration)

Benchmarks here:

Benchmark

Benchmark run from 2020-10-02 11:56:33.665866Z UTC

System

Benchmark suite executing on the following system:

Operating System Linux
CPU Information Intel(R) Core(TM) i7-6700HQ CPU @ 2.60GHz
Number of Available Cores 8
Available Memory 7.87 GB
Elixir Version 1.10.4
Erlang Version 23.0.3

Configuration

Benchmark suite executing with the following configuration:

:time 5 s
:parallel 1
:warmup 2 s

Statistics

Run Time

Name IPS Average Devitation Median 99th %
Project compilation time %{"sigils" => "external", "variables and atoms" => "together"} 0.149 6.69 s ±0.00% 6.69 s 6.69 s
Project compilation time %{"sigils" => "external", "variables and atoms" => "split"} 0.147 6.80 s ±0.00% 6.80 s 6.80 s
Project compilation time %{"sigils" => "internal", "variables and atoms" => "together"} 0.0685 14.60 s ±0.00% 14.60 s 14.60 s
Project compilation time %{"sigils" => "internal", "variables and atoms" => "split"} 0.0641 15.60 s ±0.00% 15.60 s 15.60 s

Comparison

Name IPS Slower
Project compilation time %{"sigils" => "external", "variables and atoms" => "together"} 0.149  
Project compilation time %{"sigils" => "external", "variables and atoms" => "split"} 0.147 1.02x
Project compilation time %{"sigils" => "internal", "variables and atoms" => "together"} 0.0685 2.18x
Project compilation time %{"sigils" => "internal", "variables and atoms" => "split"} 0.0641 2.33x
josevalim commented 4 years ago

Meanwhile, I think you can ship as is.

Do you want me to ship it? I think you should ship it. :)

josevalim commented 4 years ago

FWIW, I just tried 23.0.1 and you are correct that elixir_lexer takes more time there (8s) so the benefits of removing sigils is a bit clearer but variables.ex still takes longer (8.5s). Given everything is much better anyway on next OTP, I would keep it as is. :)

tmbb commented 4 years ago

Now that we depend on NimbleParsec 1.1 should we depend on it directly or release a new makeup version that depends on NimbleParsec 1.1 and then depend on the new makeup version? I think we should depend directly on NimbleParsec 1.1 (as we're doing now)

josevalim commented 4 years ago

I don't think we need a new makeup release as it is ok for us to impose a more strong requirement than our parent (i.e. makeup). :)

tmbb commented 4 years ago

While adding tests for unicode identifiers I found a bug in makeup. Fortunately, the bug is undone when converting tokens into HTML. The bug leads to an illegal unicode string in tokens that use the lexeme combinator. This ilegal unicode string is converted into a legal one when outputing HTML. The bug was caught by my tests, though, because it causes an error when I try to convert the token values into strings for better visual inspection.

So, I've released a new version of Makeup with the bug fixed and published this MakeupElixir version o hex.