Add isascii to normalize_identifier for 25% reduction in parse time

ndinsmore commented 1 year ago

I was trying to use JuliaSyntax is some string benchmarking and noticed a lot of calls to utf8_normalization.

In master isascii is now blazing fast, so I thought that checking if a string was ascii before any UTF8 bs would speed things up. Julia code is mostly ascii

I benchmarked by parsing base/abstractarray.jl which is the largest file in the codebase.

Before pr:

julia> @benchmark t= JuliaSyntax.parseall(JuliaSyntax.SyntaxNode,$f)
BenchmarkTools.Trial: 291 samples with 1 evaluation.
 Range (min … max):  15.019 ms … 25.913 ms  ┊ GC (min … max): 0.00% … 36.31%
 Time  (median):     16.389 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   17.184 ms ±  1.963 ms  ┊ GC (mean ± σ):  4.12% ±  7.50%

       ▂▂▂█▂
  ▃▅▆▇▇█████▇▆▅█▄▃▄▁▄▃▄▁▁▂▃▃▁▁▃▁▆▃▃▃▄▁▄▂▃▃▂▃▃▁▁▃▃▁▂▁▁▂▁▁▁▂▁▁▂ ▃
  15 ms           Histogram: frequency by time        23.9 ms <

 Memory estimate: 7.97 MiB, allocs estimate: 171744.

After PR

julia> @benchmark t= JuliaSyntax.parseall(JuliaSyntax.SyntaxNode,$f)
BenchmarkTools.Trial: 373 samples with 1 evaluation.
 Range (min … max):  11.547 ms … 32.907 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     12.722 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   13.422 ms ±  2.064 ms  ┊ GC (mean ± σ):  3.20% ± 6.65%

          ▇█▃
  ▂▂▂▃▅█▇████▆▄▄▄▃▃▄▃▃▂▂▂▃▂▃▃▃▂▄▃▄▂▂▃▂▃▂▃▃▁▁▁▂▂▂▂▁▁▁▁▁▁▁▁▂▁▂▂ ▃
  11.5 ms         Histogram: frequency by time        18.6 ms <

 Memory estimate: 6.48 MiB, allocs estimate: 97785.

c42f commented 1 year ago

Nice!! I assume you refer to JuliaLang/julia#48568

Does this also help with the older isascii in Julia 1.9?

ndinsmore commented 1 year ago

The old isascii could only do about 2 bytes a cycle in <1.10, The new version will do 20+bytes/cycle. The UTF8 code was slow so I am assuming the old version should still offer a good benefit.

KristofferC commented 1 year ago

The old isascii could only do about 2 bytes a cycle in <1.10, The new version will do 20+bytes/cycle.

Most identifiers are quite short though, and the chunk size in that PR is 1024, so I am not sure you will get out the full speed from it on your average Julia identifier. Would be interesting to benchmark.

ndinsmore commented 1 year ago

I had meant to write on larger strings, none the less it does still speed up things by ~25%. We could likely craft a isascii that is tailored for smaller strings.

I should note that the benchmarking was also done with https://github.com/JuliaLang/julia/pull/48887 applied. But that shouldn't effect the back to back.

The thing is that when I profiled this utf8proc_decompose was taking up the majority of the TreeNode processing time, now it doesn't even show up in the profile.

KristofferC commented 1 year ago

I totally agree that this thing is the right thing to do. I am just not sure the performance win is due to the improvements of isascii on master. There might be the same win in 1.9 for example.

c42f commented 1 year ago

I agree this probably works and is beneficial on 1.9. I was just hoping you'd go ahead and do the benchmark for me ;-)

I ask because JuliaSyntax isn't just for Julia 1.10+. It's tested and works all the way back to julia 1.0 as needed to support use in the VSCode tooling.

Anyway, I checked and it's also a big improvement on 1.9:

julia> f = read(joinpath(Sys.BINDIR, Base.DATAROOTDIR, "julia", "base", "abstractarray.jl"), String);

# before
julia> @benchmark JuliaSyntax.parseall(JuliaSyntax.SyntaxNode,$f)
BenchmarkTools.Trial: 360 samples with 1 evaluation.
 Range (min … max):  12.632 ms … 22.863 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     12.998 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   13.906 ms ±  1.894 ms  ┊ GC (mean ± σ):  5.84% ± 9.83%

  ▅██▇▅▂   ▂                                                   
  ██████▇█▇███▄▄▁▁▁▄▁▁▁▁▁▁▄▁▁▁▁▁▄▇▇▇▇▇▇▇█▆▆▄▆▇█▆▁▆█▁▁▆▆▄▄▁▁▁▄ ▇
  12.6 ms      Histogram: log(frequency) by time      19.6 ms <

 Memory estimate: 8.03 MiB, allocs estimate: 172102.

# after
julia> @benchmark JuliaSyntax.parseall(JuliaSyntax.SyntaxNode,$f)
BenchmarkTools.Trial: 496 samples with 1 evaluation.
 Range (min … max):   9.157 ms … 15.936 ms  ┊ GC (min … max): 0.00% … 35.99%
 Time  (median):      9.444 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   10.075 ms ±  1.461 ms  ┊ GC (mean ± σ):  5.73% ± 10.30%

  ▄▇█▇▅▃▁                                                      
  ███████▇▇█▅▇▅▅▇▅▄▁▁▄▄▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄▆█▅▇██▆▇▇▁▇▆▁▇▇▆▄▆▄▅▄ ▇
  9.16 ms      Histogram: log(frequency) by time      14.4 ms <

 Memory estimate: 6.53 MiB, allocs estimate: 98538.

codecov[bot] commented 1 year ago

Codecov Report

Merging #228 (320738a) into main (56c33b2) will not change coverage. The diff coverage is 100.00%.

@@           Coverage Diff           @@
##             main     #228   +/-   ##
=======================================
  Coverage   96.31%   96.31%           
=======================================
  Files          15       15           
  Lines        3913     3913           
=======================================
  Hits         3769     3769           
  Misses        144      144

Impacted Files	Coverage Δ
src/literal_parsing.jl	`97.87% <100.00%> (ø)`

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

PallHaraldsson commented 1 year ago

Do you know if the same improvement (automatically) applies in the default Julia parser? Can (and should) it otherwise be added there easily, until yours becomes the default, and maybe even be backported?

c42f commented 1 year ago

Julia's reference parser already uses an equivalent optimization internally (see the C code in - fl_accum_julia_symbol)

JuliaLang / JuliaSyntax.jl

Add isascii to normalize_identifier for 25% reduction in parse time #228

Codecov Report