JuliaLang / Tokenize.jl

Tokenization for Julia source code
Other
104 stars 30 forks source link

Use Base.hash #192

Closed pfitzseb closed 2 years ago

pfitzseb commented 2 years ago

This improves performance by a bit when tokenizing Base and reduces code complexity:

Tokenize on master [$!+] 
λ julia --project=. test/profile.jl
First run took 0.51364849 seconds with 8.73827 MB allocated
Tokenized 1556 files in 0.8023270789999989 seconds with 923.379736 MB allocated

Tokenize on sp/better-hash [$] 
λ julia --project=. test/profile.jl      
First run took 0.485027563 seconds with 4.93732 MB allocated
Tokenized 1556 files in 0.6423206330000001 seconds with 923.379736 MB allocated
codecov[bot] commented 2 years ago

Codecov Report

Merging #192 (65f8b74) into master (860bdc6) will increase coverage by 0.12%. The diff coverage is 96.42%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #192      +/-   ##
==========================================
+ Coverage   82.80%   82.93%   +0.12%     
==========================================
  Files           4        4              
  Lines         826      832       +6     
==========================================
+ Hits          684      690       +6     
  Misses        142      142              
Impacted Files Coverage Δ
src/lexer.jl 93.80% <96.42%> (+0.05%) :arrow_up:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 860bdc6...65f8b74. Read the comment docs.

KristofferC commented 2 years ago

Any idea where all the allocations are from?

pfitzseb commented 2 years ago

--track-allocation attributes 3.7MB to https://github.com/JuliaLang/Tokenize.jl/blob/bed7f32b8abee424d060358cd361a1c0a9d95076/src/lexer.jl#L853 and 1 MB to https://github.com/JuliaLang/Tokenize.jl/blob/bed7f32b8abee424d060358cd361a1c0a9d95076/src/lexer.jl#L100

Still looking for the rest...

KristofferC commented 2 years ago

https://github.com/JuliaLang/Tokenize.jl/blob/bed7f32b8abee424d060358cd361a1c0a9d95076/src/lexer.jl#L853

looks kinda crazy to me. Why would we need to copy the whole lexer?

pfitzseb commented 2 years ago

Because interpolation resets everything?

Anyways, the remaining allocs were due to me reading the whole file into a string while profiling; without that, I get

λ julia --project=. test/profile.jl
First run took 0.561283966 seconds with 73.583024 MB allocated
Tokenized 1556 files in 3.632539304999996 seconds with 2.368488 MB allocated
pfitzseb commented 2 years ago

image is what the profile looks like. So yes, allocating the new IOBuffer() in copy looks expensive, but that's mostly because the GC runs then.

KristofferC commented 2 years ago

Because interpolation resets everything?

Hmm, well not everything? You are still reading characters into the same char buffer, so that one seems unnecessary to copy, or?

pfitzseb commented 2 years ago

Right, but crucially we can't reuse l.charstore. The latest commit removes the copy and is faster in the RawToken case, but slightly worse for normal Tokens. I'll leave the decision on what implementation to use up to you :)

KristofferC commented 2 years ago

RawToken seems what you would use if you want the best performance so I think it makes sense to optimize for that.

KristofferC commented 2 years ago

@pfitzseb, is this ready to go?

pfitzseb commented 2 years ago

I think so, but it doesn't actually do much :P

KristofferC commented 2 years ago

The benchmark in the first post looks pretty good, or?

c42f commented 2 years ago

FWIW I based my copy/fork in JuliaSyntax.Tokenize off this branch. I'd be happy to see it merged.