Closed pfitzseb closed 2 years ago
Merging #192 (65f8b74) into master (860bdc6) will increase coverage by
0.12%
. The diff coverage is96.42%
.
@@ Coverage Diff @@
## master #192 +/- ##
==========================================
+ Coverage 82.80% 82.93% +0.12%
==========================================
Files 4 4
Lines 826 832 +6
==========================================
+ Hits 684 690 +6
Misses 142 142
Impacted Files | Coverage Δ | |
---|---|---|
src/lexer.jl | 93.80% <96.42%> (+0.05%) |
:arrow_up: |
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact)
,ø = not affected
,? = missing data
Powered by Codecov. Last update 860bdc6...65f8b74. Read the comment docs.
Any idea where all the allocations are from?
--track-allocation
attributes 3.7MB to
https://github.com/JuliaLang/Tokenize.jl/blob/bed7f32b8abee424d060358cd361a1c0a9d95076/src/lexer.jl#L853
and 1 MB to
https://github.com/JuliaLang/Tokenize.jl/blob/bed7f32b8abee424d060358cd361a1c0a9d95076/src/lexer.jl#L100
Still looking for the rest...
looks kinda crazy to me. Why would we need to copy the whole lexer?
Because interpolation resets everything?
Anyways, the remaining allocs were due to me reading the whole file into a string while profiling; without that, I get
λ julia --project=. test/profile.jl
First run took 0.561283966 seconds with 73.583024 MB allocated
Tokenized 1556 files in 3.632539304999996 seconds with 2.368488 MB allocated
is what the profile looks like. So yes, allocating the new IOBuffer()
in copy
looks expensive, but that's mostly because the GC runs then.
Because interpolation resets everything?
Hmm, well not everything? You are still reading characters into the same char buffer, so that one seems unnecessary to copy, or?
Right, but crucially we can't reuse l.charstore
. The latest commit removes the copy and is faster in the RawToken
case, but slightly worse for normal Token
s. I'll leave the decision on what implementation to use up to you :)
RawToken seems what you would use if you want the best performance so I think it makes sense to optimize for that.
@pfitzseb, is this ready to go?
I think so, but it doesn't actually do much :P
The benchmark in the first post looks pretty good, or?
FWIW I based my copy/fork in JuliaSyntax.Tokenize
off this branch. I'd be happy to see it merged.
This improves performance by a bit when tokenizing Base and reduces code complexity: