How diffsitter works, How performant it is and What cases are exemplatory good and bad

matu3ba commented 3 years ago

Difftastic has a complete section dedicated what methods it uses, but has the performance not written in the repo but in the inspired projects: https://github.com/Wilfred/difftastic#how-it-works Could you elaborate what algorithm you use for the diff? A and dijkstra for graph difference are the most prominent ones, where A is much more performant on multiple changes (in a function) but has overall worse results.

Performance of examples (ideally with benchmarks), Known problems and Non-Goals would complete the first impression.

I am asking due for an evaluation of neovim what to use, but I am not affiliated or core member (only neovim user) and would like to see things moving forward.

afnanenayet commented 3 years ago

I can write up how it works. I started out with a very naive min-edit-distance algorithm, but I was thinking of switching to Myers since you're effectively comparing an in-order traversal of the graph and Myers is extremely efficient in terms of space (profiling showed that the current biggest bottleneck in diffsitter is malloc). Diffsitter itself has lots of logging timers and I've done some profiling with dtrace so I know where the current hotspots are.

I can also try operating directly on the graph differences and see which approach is the fastest, I've been meaning to improve the performance for a while now.

I'll add documentation with the known problems and non-goals soon (TM).

These are great suggestions, thanks :)

krishnakumarg1984 commented 2 years ago

What do the devs here think about considering various aspects mentioned here in a future release?

0xdevalias commented 9 months ago

@afnanenayet Curious what the state of this is in 2024?

Searching the repo, it looks like the Myers algorithm may have been implemented in September 2021, eg.:

https://github.com/afnanenayet/diffsitter/pull/192

I was just trying to diff a large bundled/minified JS file (~8.4MB; ~249,128 lines long), and noticed that majority of the time was spent in the diff::compute_edit_script() step (737.685108096s, aka: ~12.29min); whereas parsing/processing the AST only took less than 2s each. I also think it might have silently crashed somewhere along the way, but I haven't had a chance to try it again to confirm that yet. You can see more detailed info/specifics in this comment:

https://github.com/0xdevalias/chatgpt-source-watch/issues/3#issuecomment-1916176579

Minimising the debug output to just the relevant timing events:

// Start
2024-01-30T07:07:58.282Z INFO  libdiffsitter::parse > Deduced language "typescript" from extension "js" provided from user mappings
// ..snip..
2024-01-30T07:07:59.791Z INFO  TimerFinished        > parse::parse_file(), Elapsed=1.50364822s
// ..snip..
2024-01-30T07:08:01.160Z INFO  TimerFinished        > parse::parse_file(), Elapsed=1.364231801s
2024-01-30T07:08:01.985Z INFO  TimerFinished        > ast::from_ts_tree(), Elapsed=825.767557ms
2024-01-30T07:08:02.972Z INFO  TimerFinished        > ast::process(), Elapsed=1.812168007s
2024-01-30T07:08:03.814Z INFO  TimerFinished        > ast::from_ts_tree(), Elapsed=841.781925ms
2024-01-30T07:08:04.787Z INFO  TimerFinished        > ast::process(), Elapsed=1.815188386s
2024-01-30T07:20:22.478Z INFO  TimerFinished        > diff::compute_edit_script(), Elapsed=737.685108096s
// ..snip..
2024-01-30T07:20:22.822Z DEBUG libdiffsitter::render::unified > Printing line 96301
// Finish

As a point of comparison, difftastic was seemingly able to diff this file in ~6.746sec total:

⇒ time git difftool --tool difftastic HEAD~1 HEAD -- unpacked/_next/static/chunks/pages/_app.js | subl
git difftool --tool difftastic HEAD~1 HEAD --   2.63s user 0.46s system 47% cpu 6.494 total
subl  0.01s user 0.02s system 0% cpu 6.746 total

Though with a lot of this printed among the diff output:

_app.js --- 1/674 --- Text (8.39 MiB exceeded DFT_BYTE_LIMIT)

Edit: It seems when DFT_BYTE_LIMIT is exceeded difftastic falls back to a text diff, so that's not really a fair time comparison:

https://github.com/Wilfred/difftastic/blob/master/CHANGELOG.md#020-released-20th-february-2022
- difftastic will now use a text diff for large files that are too big to parse in a reasonable amount of time. This threshold is configurable with --byte-limit and DFT_BYTE_LIMIT.

I tried overriding that in my .gitconfig:

# https://github.com/Wilfred/difftastic
[difftool "difftastic"]
  cmd = difft --byte-limit 20971520 "$LOCAL" "$REMOTE"

And then running it again, but then I just got a different set of errors:

 ⇒ time git difftool --tool difftastic HEAD~1 HEAD -- unpacked/_next/static/chunks/pages/_app.js | subl
git difftool --tool difftastic HEAD~1 HEAD --   12.42s user 1.10s system 79% cpu 17.043 total
subl  0.01s user 0.02s system 0% cpu 17.248 total

_app.js --- 1/674 --- Text (2 JavaScript parse errors, exceeded DFT_PARSE_ERROR_LIMIT)

afnanenayet commented 9 months ago

Could you send me the file you were trying to diff? Would love to do some profiling and see if there's anything I can do to make diffsitter faster. Alternatively we could have some timeout for computing the edit script and fallback to a simpler diff algo or error out.

I'm actually working on ditching the myers diff algo in favor of computing a diff on the tree directly

afnanenayet commented 9 months ago

I also think it might have silently crashed somewhere along the way, but I haven't had a chance to try it again to confirm that yet.

Also very intrigued by this. What did you see that led you to suspect this? If you get a chance to investigate feel free to throw me whatever you got.

Thanks for all the work you've put in to raising issues and helping with debugging, it's much appreciated!

0xdevalias commented 9 months ago

Could you send me the file you were trying to diff?

@afnanenayet Was trying to diff these 2 (basically to see the changes between the 2 builds/commits):

Alternatively we could have some timeout for computing the edit script and fallback to a simpler diff algo or error out.

@afnanenayet For my specific use case / interest here, timing out would be suboptimal, as ideally I would like it to run even if it takes a longer time to do it (assuming the output ends up being good). Though a timeout/error for others in more normal use cases might make sense.

I'm actually working on ditching the myers diff algo in favor of computing a diff on the tree directly

@afnanenayet Oo, interesting! Is that for speed reasons, or better outcomes, or?

I also think it might have silently crashed somewhere along the way

What did you see that led you to suspect this? If you get a chance to investigate feel free to throw me whatever you got.

@afnanenayet See below for more details, but it was basically a gut feel based on not seeing the diff output (I might have closed the file it was piping to by accident though), and not seeing the 'end hunk' line in the debug logs.

You can see the full output/etc in this comment, look for this part:

We can then see where the real performance hit is when running this against a MUCH larger file/diff (the file being diffed is ~8.4MB; ~249,128 lines long) diffsitter debug output from large example
We can see that ~12.29 minutes was spent in diff::compute_edit_script(), and then it looks like the script might have silently crashed, as I didn't see the diff output, nor the final DEBUG libdiffsitter::render::unified > End hunk line that I would have expected to see.

Note: I was piping this command to subl (Sublime Text), despite what the code block above was edited to look like; I haven't tried running this again yet printing directly to STDOUT/a file. I also haven't tried running this without debug mode to see if it somehow doesn't crash that way.
https://github.com/0xdevalias/chatgpt-source-watch/issues/3#issuecomment-1916176579

Thanks for all the work you've put in to raising issues and helping with debugging, it's much appreciated!

@afnanenayet Thanks for making an awesome tool! It's exciting to see such quick/positive responses; I had somewhat assumed that it might have been abandonware.

afnanenayet commented 9 months ago

Nope! Not abandonware, I'm just busy and have been working on these things really slowly

0xdevalias commented 9 months ago

Haha, totally understand that feel.

afnanenayet / diffsitter

How diffsitter works, How performant it is and What cases are exemplatory good and bad #149