Open matu3ba opened 3 years ago
I can write up how it works. I started out with a very naive min-edit-distance algorithm, but I was thinking of switching to Myers since you're effectively comparing an in-order traversal of the graph and Myers is extremely efficient in terms of space (profiling showed that the current biggest bottleneck in diffsitter is malloc
). Diffsitter itself has lots of logging timers and I've done some profiling with dtrace so I know where the current hotspots are.
I can also try operating directly on the graph differences and see which approach is the fastest, I've been meaning to improve the performance for a while now.
I'll add documentation with the known problems and non-goals soon (TM).
These are great suggestions, thanks :)
What do the devs here think about considering various aspects mentioned here in a future release?
@afnanenayet Curious what the state of this is in 2024?
Searching the repo, it looks like the Myers algorithm may have been implemented in September 2021, eg.:
I was just trying to diff a large bundled/minified JS file (~8.4MB; ~249,128 lines long), and noticed that majority of the time was spent in the diff::compute_edit_script()
step (737.685108096s
, aka: ~12.29min
); whereas parsing/processing the AST only took less than 2s
each. I also think it might have silently crashed somewhere along the way, but I haven't had a chance to try it again to confirm that yet. You can see more detailed info/specifics in this comment:
Minimising the debug output to just the relevant timing events:
// Start
2024-01-30T07:07:58.282Z INFO libdiffsitter::parse > Deduced language "typescript" from extension "js" provided from user mappings
// ..snip..
2024-01-30T07:07:59.791Z INFO TimerFinished > parse::parse_file(), Elapsed=1.50364822s
// ..snip..
2024-01-30T07:08:01.160Z INFO TimerFinished > parse::parse_file(), Elapsed=1.364231801s
2024-01-30T07:08:01.985Z INFO TimerFinished > ast::from_ts_tree(), Elapsed=825.767557ms
2024-01-30T07:08:02.972Z INFO TimerFinished > ast::process(), Elapsed=1.812168007s
2024-01-30T07:08:03.814Z INFO TimerFinished > ast::from_ts_tree(), Elapsed=841.781925ms
2024-01-30T07:08:04.787Z INFO TimerFinished > ast::process(), Elapsed=1.815188386s
2024-01-30T07:20:22.478Z INFO TimerFinished > diff::compute_edit_script(), Elapsed=737.685108096s
// ..snip..
2024-01-30T07:20:22.822Z DEBUG libdiffsitter::render::unified > Printing line 96301
// Finish
As a point of comparison, difftastic
was seemingly able to diff this file in ~6.746sec
total:
⇒ time git difftool --tool difftastic HEAD~1 HEAD -- unpacked/_next/static/chunks/pages/_app.js | subl
git difftool --tool difftastic HEAD~1 HEAD -- 2.63s user 0.46s system 47% cpu 6.494 total
subl 0.01s user 0.02s system 0% cpu 6.746 total
Though with a lot of this printed among the diff output:
_app.js --- 1/674 --- Text (8.39 MiB exceeded DFT_BYTE_LIMIT)
Edit: It seems when DFT_BYTE_LIMIT
is exceeded difftastic
falls back to a text diff, so that's not really a fair time comparison:
difftastic
will now use a text diff for large files that are too big to parse in a reasonable amount of time. This threshold is configurable with--byte-limit
andDFT_BYTE_LIMIT
.
I tried overriding that in my .gitconfig
:
# https://github.com/Wilfred/difftastic
[difftool "difftastic"]
cmd = difft --byte-limit 20971520 "$LOCAL" "$REMOTE"
And then running it again, but then I just got a different set of errors:
⇒ time git difftool --tool difftastic HEAD~1 HEAD -- unpacked/_next/static/chunks/pages/_app.js | subl
git difftool --tool difftastic HEAD~1 HEAD -- 12.42s user 1.10s system 79% cpu 17.043 total
subl 0.01s user 0.02s system 0% cpu 17.248 total
_app.js --- 1/674 --- Text (2 JavaScript parse errors, exceeded DFT_PARSE_ERROR_LIMIT)
Could you send me the file you were trying to diff? Would love to do some profiling and see if there's anything I can do to make diffsitter faster. Alternatively we could have some timeout for computing the edit script and fallback to a simpler diff algo or error out.
I'm actually working on ditching the myers diff algo in favor of computing a diff on the tree directly
I also think it might have silently crashed somewhere along the way, but I haven't had a chance to try it again to confirm that yet.
Also very intrigued by this. What did you see that led you to suspect this? If you get a chance to investigate feel free to throw me whatever you got.
Thanks for all the work you've put in to raising issues and helping with debugging, it's much appreciated!
Could you send me the file you were trying to diff?
@afnanenayet Was trying to diff these 2 (basically to see the changes between the 2 builds/commits):
Alternatively we could have some timeout for computing the edit script and fallback to a simpler diff algo or error out.
@afnanenayet For my specific use case / interest here, timing out would be suboptimal, as ideally I would like it to run even if it takes a longer time to do it (assuming the output ends up being good). Though a timeout/error for others in more normal use cases might make sense.
I'm actually working on ditching the myers diff algo in favor of computing a diff on the tree directly
@afnanenayet Oo, interesting! Is that for speed reasons, or better outcomes, or?
I also think it might have silently crashed somewhere along the way
What did you see that led you to suspect this? If you get a chance to investigate feel free to throw me whatever you got.
@afnanenayet See below for more details, but it was basically a gut feel based on not seeing the diff output (I might have closed the file it was piping to by accident though), and not seeing the 'end hunk' line in the debug logs.
You can see the full output/etc in this comment, look for this part:
We can then see where the real performance hit is when running this against a MUCH larger file/diff (the file being diffed is ~8.4MB; ~249,128 lines long)
diffsitter
debug output from large example
We can see that
~12.29
minutes was spent indiff::compute_edit_script()
, and then it looks like the script might have silently crashed, as I didn't see the diff output, nor the finalDEBUG libdiffsitter::render::unified > End hunk
line that I would have expected to see.Note: I was piping this command to
subl
(Sublime Text), despite what the code block above was edited to look like; I haven't tried running this again yet printing directly to STDOUT/a file. I also haven't tried running this without debug mode to see if it somehow doesn't crash that way.
https://github.com/0xdevalias/chatgpt-source-watch/issues/3#issuecomment-1916176579
Thanks for all the work you've put in to raising issues and helping with debugging, it's much appreciated!
@afnanenayet Thanks for making an awesome tool! It's exciting to see such quick/positive responses; I had somewhat assumed that it might have been abandonware.
Nope! Not abandonware, I'm just busy and have been working on these things really slowly
Haha, totally understand that feel.
Difftastic has a complete section dedicated what methods it uses, but has the performance not written in the repo but in the inspired projects: https://github.com/Wilfred/difftastic#how-it-works Could you elaborate what algorithm you use for the diff? A and dijkstra for graph difference are the most prominent ones, where A is much more performant on multiple changes (in a function) but has overall worse results.
Performance of examples (ideally with benchmarks), Known problems and Non-Goals would complete the first impression.
I am asking due for an evaluation of neovim what to use, but I am not affiliated or core member (only neovim user) and would like to see things moving forward.