Open 0xdevalias opened 9 months ago
I would recommend difftastic for this! Actually rspack has already used it for checking diff between its output with that of webpack.
I would recommend difftastic for this
rspack has already used it for checking diff between its output with that of webpack
@HerringtonDarkholme Interesting.. do you know if they did so while suppressing the 'noise' of changed variables? Or was it more just generally to ensure they were doing compatible things. I had a quick google, but didn't seem to turn up anything specific beyond the repo/etc:
A fast Rust-based web bundler
Edit: Opened the following issue on difftastic
:
Edit 2: And this one on diffsitter
:
I was also just re-reading through the diffsitter
README and noticed this section that I somehow missed in the past; which sounds like it might be exactly like what I want:
A tree-sitter based AST difftool to get meaningful semantic diffs
You can also filter which tree sitter nodes are considered in the diff through the config file.
You can filter the nodes that are considered in the diff by setting
include_nodes
orexclude_nodes
in the config file.exclude_nodes
always takes precedence overinclude_nodes
, and the type of a node is the kind of a tree-sitter node.This feature currently only applies to leaf nodes, but we could exclude nodes recursively if there's demand for it.
Though playing with diffsitter
just now, it's output format seems to leave a lot to be desired compared to typical git diff
unified output; and it doesn't show any context/etc currently:
eg. On a very minimal example, diffsitter
:
⇒ git difftool --tool diffsitter HEAD~1 HEAD -- unpacked/_next/static/\[buildHash\]/_buildManifest.js
/var/folders/j4/kxtq1cjs1l98xfqncjbsbx1c0000gn/T//git-blob-AOnHKy/_buildManifest.js
/var/folders/j4/kxtq1cjs1l98xfqncjbsbx1c0000gn/T//git-blob-ahyGIo/_buildManifest.js
===================================================================================
80:
---
- "/search": ["static/chunks/pages/search-8da35bbb0f092dc3.js"],
80:
---
+ "/search": ["static/chunks/pages/search-d835393483b5432a.js"],
138:
----
+ "static/chunks/5054-e2060ddbea2abdb7.js"
138:
----
- "static/chunks/5054-8ad3d13d663a6185.js"
Vs git diff
(with delta
):
⇒ git diff HEAD~1 HEAD -- unpacked/_next/static/\[buildHash\]/_buildManifest.js
diff --git a/unpacked/_next/static/[buildHash]/_buildManifest.js b/unpacked/_next/static/[buildHash]/_buildManifest.js
index 851a8f0..5004cc7 100644
--- a/unpacked/_next/static/[buildHash]/_buildManifest.js
+++ b/unpacked/_next/static/[buildHash]/_buildManifest.js
@@ -78,7 +78,7 @@
"/payments/success-trial": [
"static/chunks/pages/payments/success-trial-84597e34390c1506.js",
],
- "/search": ["static/chunks/pages/search-8da35bbb0f092dc3.js"],
+ "/search": ["static/chunks/pages/search-d835393483b5432a.js"],
"/share/e/[[...shareParams]]": [
"static/chunks/pages/share/e/[[...shareParams]]-899e50f90dac9ff5.js",
],
@@ -136,6 +136,6 @@
"static/chunks/5017-f7c5e142fc7f0516.js",
"static/chunks/3975-37a9301353b29c5d.js",
"static/chunks/3754-ae5dc2fb759ecfc1.js",
- "static/chunks/5054-8ad3d13d663a6185.js"
+ "static/chunks/5054-e2060ddbea2abdb7.js"
)),
self.__BUILD_MANIFEST_CB && self.__BUILD_MANIFEST_CB();
This is the diffsitter
config I was using:
~/.config/diffsitter/config.json5
// Default: `diffsitter dump-default-config`
// See also: https://github.com/afnanenayet/diffsitter/blob/v0.8.1/assets/sample_config.json5
// Colours: `color256`, `black`, `red`, `green`, `yellow`, `blue`, `magenta`, `cyan`, `white`
{
"formatting": {
"default": "unified",
"unified": {
"addition": {
"highlight": "green",
"regular-foreground": "green",
"emphasized-foreground": "white",
"bold": true,
"underline": false,
"prefix": "+ "
},
"deletion": {
"highlight": "red",
"regular-foreground": "red",
"emphasized-foreground": "white",
"bold": true,
"underline": false,
"prefix": "- "
}
},
"json": {
"pretty_print": false
},
"custom": {}
},
"grammar": {
"dylib-overrides": null,
"file-associations": {
"js": "typescript",
"jsx": "tsx"
},
},
"input-processing": {
"split-graphemes": true,
// You can exclude different tree sitter node types - this rule takes precedence over `include_kinds`.
"exclude-kinds": null,
// "exclude-kinds": ["string"],
// You can specifically allow only certain tree sitter node types
"include-kinds": null
// "include-kinds": ["method_definition"],
},
// Specify a fallback command if diffsitter can't parse the given input
// files. This is invoked by diffsitter as:
//
// ```sh
// ${fallback_cmd} ${old} ${new}
// ```
"fallback-cmd": null,
// "fallback-cmd": "diff",
}
And this is the .gitconfig
I was using to run it as a git difftool
:
# https://github.com/afnanenayet/diffsitter
[difftool "diffsitter"]
cmd = diffsitter "$LOCAL" "$REMOTE"
# https://github.com/afnanenayet/diffsitter
[difftool "diffsitter-debug"]
cmd = diffsitter --debug "$LOCAL" "$REMOTE"
Running it with --debug
allows us to see the steps it takes during processing, and how long they take. On our small example file above, the output looks like this:
diffsitter
debug output from minimal exampleReducing that output to just show the relevant timing events:
// Start
2024-01-30T07:09:30.627Z INFO libdiffsitter::parse > Deduced language "typescript" from extension "js" provided from user mappings
// ..snip..
2024-01-30T07:09:30.628Z INFO TimerFinished > parse::parse_file(), Elapsed=968.896µs
// ..snip..
2024-01-30T07:09:30.629Z INFO TimerFinished > parse::parse_file(), Elapsed=590.376µs
2024-01-30T07:09:30.629Z INFO TimerFinished > ast::from_ts_tree(), Elapsed=388.582µs
2024-01-30T07:09:30.630Z INFO TimerFinished > ast::process(), Elapsed=1.083248ms
2024-01-30T07:09:30.630Z INFO TimerFinished > ast::from_ts_tree(), Elapsed=309.698µs
2024-01-30T07:09:30.631Z INFO TimerFinished > ast::process(), Elapsed=997.049µs
2024-01-30T07:09:30.631Z INFO TimerFinished > diff::compute_edit_script(), Elapsed=127.357µs
// ..snip..
2024-01-30T07:09:30.631Z DEBUG libdiffsitter::render::unified > End hunk (lines 138 - 138)
// Finish
We can then see where the real performance hit is when running this against a MUCH larger file/diff (the file being diffed is ~8.4MB; ~249,128 lines long)
diffsitter
debug output from large exampleReducing that output to just show the relevant timing events:
// Start
2024-01-30T07:07:58.282Z INFO libdiffsitter::parse > Deduced language "typescript" from extension "js" provided from user mappings
// ..snip..
2024-01-30T07:07:59.791Z INFO TimerFinished > parse::parse_file(), Elapsed=1.50364822s
// ..snip..
2024-01-30T07:08:01.160Z INFO TimerFinished > parse::parse_file(), Elapsed=1.364231801s
2024-01-30T07:08:01.985Z INFO TimerFinished > ast::from_ts_tree(), Elapsed=825.767557ms
2024-01-30T07:08:02.972Z INFO TimerFinished > ast::process(), Elapsed=1.812168007s
2024-01-30T07:08:03.814Z INFO TimerFinished > ast::from_ts_tree(), Elapsed=841.781925ms
2024-01-30T07:08:04.787Z INFO TimerFinished > ast::process(), Elapsed=1.815188386s
2024-01-30T07:20:22.478Z INFO TimerFinished > diff::compute_edit_script(), Elapsed=737.685108096s
// ..snip..
2024-01-30T07:20:22.822Z DEBUG libdiffsitter::render::unified > Printing line 96301
// Finish
We can see that ~12.29 minutes was spent in diff::compute_edit_script()
, and then it looks like the script might have silently crashed, as I didn't see the diff output, nor the final DEBUG libdiffsitter::render::unified > End hunk
line that I would have expected to see.
Note: I was piping this command to subl
(Sublime Text), despite what the code block above was edited to look like; I haven't tried running this again yet printing directly to STDOUT/a file. I also haven't tried running this without debug mode to see if it somehow doesn't crash that way.
As a point of comparison, difftastic
was seemingly able to diff this file in ~2.46sec total:
⇒ time git difftool --tool difftastic HEAD~1 HEAD -- unpacked/_next/static/chunks/pages/_app.js | subl
git difftool --tool difftastic HEAD~1 HEAD -- 2.63s user 0.46s system 47% cpu 6.494 total
subl 0.01s user 0.02s system 0% cpu 6.746 total
Though with a lot of this printed among the diff output:
_app.js --- 1/674 --- Text (8.39 MiB exceeded DFT_BYTE_LIMIT)
If a file exceeds
DFT_BYTE_LIMIT
,difftastic
now displays its size in the header.
Fixed an issue with unwanted underlines with textual diffing when
DFT_BYTE_LIMIT
is reached.
difftastic
will now use a text diff for large files that are too big to parse in a reasonable amount of time. This threshold is configurable with--byte-limit
andDFT_BYTE_LIMIT
.
It seems that when DFT_BYTE_LIMIT
is exceeded, difftastic
falls back to just doing a basic text diff, so those above timings aren't really a fair comparison. We can override this with --byte-limit
and DFT_BYTE_LIMIT
:
⇒ difftastic -h
Difftastic 0.52.0
..snip..
OPTIONS:
..snip..
--byte-limit <LIMIT> Use a text diff if either input file exceeds this size. [env:
DFT_BYTE_LIMIT=] [default: 1000000]
..snip..
--graph-limit <LIMIT> Use a text diff if the structural graph exceed this number of
nodes in memory. [env: DFT_GRAPH_LIMIT=] [default: 3000000]
Let's set this to something much higher than we need for now:
20*1024*1024 = 20971520
Updating our .gitconfig
:
# https://github.com/Wilfred/difftastic
[difftool "difftastic"]
cmd = difft --byte-limit 20971520 "$LOCAL" "$REMOTE"
Running that command again:
⇒ time git difftool --tool difftastic HEAD~1 HEAD -- unpacked/_next/static/chunks/pages/_app.js | subl
git difftool --tool difftastic HEAD~1 HEAD -- 12.42s user 1.10s system 79% cpu 17.043 total
subl 0.01s user 0.02s system 0% cpu 17.248 total
It takes a bit longer, but then we hit a different error/limit:
_app.js --- 1/674 --- Text (2 JavaScript parse errors, exceeded DFT_PARSE_ERROR_LIMIT)
From the help:
⇒ difft --help | subl
..snip..
--parse-error-limit <LIMIT>
Use a text diff if the number of parse errors exceeds this value.
[env: DFT_PARSE_ERROR_LIMIT=]
[default: 0]
There seem to be 674
matches of DFT_PARSE_ERROR_LIMIT
in the output, which corresponds to the 674
chunks of output.
I didn't see any specific errors output related to the parsing though.. I wonder how we can see what the actual issue was?
It seems we can dump the AST either from tree-sitter
directly, or the difftastic
filtered version of it as a debug:
⇒ difft --help | subl
..snip..
DEBUG OPTIONS:
--dump-syntax <PATH>
Parse a single file with tree-sitter and display the difftastic syntax tree.
--dump-ts <PATH>
Parse a single file with tree-sitter and display the tree-sitter parse tree.
Full tree-sitter
AST:
⇒ time difft --dump-ts unpacked/_next/static/chunks/pages/_app.js > _app.js-difftastic-tree-sitter.ast
difft --dump-ts unpacked/_next/static/chunks/pages/_app.js > 5.08s user 9.49s system 95% cpu 15.312 total
⇒ du -h _app.js-difftastic-tree-sitter.ast
224M _app.js-difftastic-tree-sitter.ast
⇒ cat _app.js-difftastic-tree-sitter.ast| wc -l
2533687
difftastic
'syntax' AST:
⇒ time difft --dump-syntax unpacked/_next/static/chunks/pages/_app.js > _app.js-difftastic.ast
difft --dump-syntax unpacked/_next/static/chunks/pages/_app.js > 434.80s user 43.33s system 93% cpu 8:29.31 total
⇒ du -h _app.js-difftastic.ast
2.2G _app.js-difftastic.ast
⇒ cat _app.js-difftastic.ast | wc -l
10571815
After the above explorations, I ended up taking a different tact and started exploring the 'post processing git diff
' approach.
Initial PoC tests seemed to show some merit, so I hacked them together into a script that can filter an existing git diff
passed in from a file, or STDIN:
Some alternative potential workarounds I've considered are: pre-processing the files to standardize their variable/function names; and/or post-processing the diff output to try and detect when the only changes in a chunk are variable/function names; then suppressing that chunk.
After exploring/playing around with a bunch of tools yesterday, and a hyperfocus of hacking out some PoC code this afternoon/evening, I think I have the elements of a workable solution! 🎉
Just need to pull it all together and refine it once the 🧠 is refreshed/rested..
The code is still pretty hacky.. but pulled the PoC's together into a usable script for filtering the noise in the diff output!
Still not perfect.. but I am really liking these stats so far!
For an
8.4mb
_app.js
file (250,022
lines):Original diff:
33,399
Filtered:7,516
There are still a bunch of areas where my partial AST parsing is getting errors, which means more noise will slip through till I can fix those.. but this is already super useful!
Hopefully will get a version of it cleaned up/committed/pushed sooner rather than later.
A few more updates that I posted elsewhere as I was going; from
7100
, to6455
, to4268
, and now down to3913
lines.
Added a few more ‘pre AST parsing’ fixups that get the diff down even further: from the previous
7100
lines to6455
lines
Trialled parsing the AST with a more lenient parser as a pre-step; filtering out unknown bits, then generating that back into code; that I then apply some of my previous ‘preparation hacks’ to, before parsing it with the more strict parser and doing the identifier normalisation.
That’s currently reduced the
6455
lines down to4268
lines; and there are still leftover parsing errors that could potentially get that number lower still that I’m chipping away at.Another alternative could be to completely replace the stricter parser with the more lenient one; but then I would have to rewrite a bunch of other code as well.
I could also potentially try a different lenient parser; or pre-process the code being passed to this one; as I believe it’s failing to parse some statements even; which is potentially leading to more of that noise filtering through..
I added a pre-processing step to the lenient parser which improved things a bit; then I added an actual fix to the AST from the lenient parser to help fix more complex ternaries that were challenging to fix in my simpler string append pre/post fix methods.
With those, we’re now down to
3913
lines.Being able to properly detect and fix object properties that are missing their wrapping object would probably fix a bunch more of the issues.
Though maybe when I look at this next; I’ll just see whether using the lenient parser on its own gives a quicker/better result.
With the current PoC implementations in the diff minimiser, we're grouping by the diff chunks, and then by added/removed lines within those chunks.
Sometimes a section of code will churn and move a large amount of lines from one chunk, to another chunk, which won't get noticed by the minimiser currently.
It would be good if we could process these in a similar way to git diff
's color-moved
, with our 'ignore identifier changes', so that we can suppress both the 'added' and 'removed' chunks if they're otherwise unchanged (or minimise the diff between them to only show what has changed, rather than showing them as large 'disconnected blocks')
Edit: See also:
--
VSCode recently added a new feature called move code detection. The algorithm is surprisingly simple: after normal diff, calculate
similarity(deletion, insertion)
for everydeletion
&insertion
pair. The code can be found here https://github.com/microsoft/vscode/blob/166097a20cbd06d10d255ef561837c439f372de3/src/vs/editor/common/diff/defaultLinesDiffComputer/computeMovedLines.ts#L44-L85.But intuitively demonstrating the moving relationship on TUI might be a challenge.
Originally posted by @QuarticCat in https://github.com/Wilfred/difftastic/issues/508#issuecomment-1734691395
--
Have you looked into
git diff
's--color-moved
at all?Personally I tend to use the
zebra
mode of it. From my git aliases:# https://git-scm.com/docs/git-diff#Documentation/git-diff.txt---color-movedltmodegt # https://git-scm.com/docs/git-diff#Documentation/git-diff.txt---color-moved-wsltmodesgt # https://git-scm.com/docs/git-config#Documentation/git-config.txt-color # https://git-scm.com/docs/git-config#Documentation/git-config.txt-colordiffltslotgt diff-refactor = \ -c color.diff.oldMoved='white dim' \ -c color.diff.oldMovedAlternative='white dim' \ -c color.diff.newMoved='white dim' \ -c color.diff.newMovedAlternative='white dim' \ -c color.diff.newMovedDimmed='white dim' \ -c color.diff.newMovedAlternativeDimmed='white dim' \ diff --ignore-blank-lines --color-moved=dimmed-zebra --color-moved-ws=ignore-all-space --minimal
Originally posted by @0xdevalias in https://github.com/Wilfred/difftastic/issues/539#issuecomment-1916033305
A lot of this will probably be uninteresting/useless noise, but I just made a repo and pushed all of my hacky WIP/PoC AST exploration scripts to it so they're captured in one place:
- https://github.com/0xdevalias/poc-ast-tools
PoC scripts and tools for working with (primarily JavaScript) ASTs.
What might be of interest though, is that I pushed the current hacky state of my diff minimiser code there too. I haven't checked it for a good few weeks, so can't even remember if it's in a runnable state currently, but if nothing else, there might be some ideas hidden away in there that help explain what I was referring to above:
When I get a chance to get back to them, I plan to finish my research/refinements on the best method(s), and then clean it all up back to a single useful script that I will commit to this repo. But until then, those may be of some interest to someone.
Originally posted by @0xdevalias in https://github.com/0xdevalias/chatgpt-source-watch/issues/10#issuecomment-2024566283
Here's a fun little hack I just thought up (while the diff minimiser script still breaks
--color-moved
):
- Diff 2 files, run them through the minimiser, save that diff to a file
- Apply the output of the minimised diff to the original file, save that to a file
- Re-run diff with
--color-moved
between the original file, and the original+minimisedDiffPatch fileCurrently this doesn't work fully, as I think I'm not properly updating the line metadata in the diff hunks when I've removed sections from them in the diff minimiser.. but something like this (in theory):
⇒ git diff 6d96656142aabcc862846ad7852422f0a8f14dbe:unpacked/_next/static/chunks/pages/_app.js 6eab9108d9ed3124b4ba757ee9f29e892082deb5:unpacked/_next/static/chunks/pages/_app.js | ./scripts/ast-poc/diff-minimiser.js 2>/dev/null | \ sed 's|a/unpacked/_next/static/chunks/pages/_app.js|a/_app.original.js|g' | \ sed 's|b/unpacked/_next/static/chunks/pages/_app.js|b/_app.patched.js|g' \ > _app.minimised.diff ⇒ git show 6d96656142aabcc862846ad7852422f0a8f14dbe:unpacked/_next/static/chunks/pages/_app.js > _app.original.js ⇒ git apply _app.minimised.diff ⇒ git -c color.diff.oldMoved='white dim' \ -c color.diff.oldMovedAlternative='white dim' \ -c color.diff.newMoved='white dim' \ -c color.diff.newMovedAlternative='white dim' \ -c color.diff.newMovedDimmed='white dim' \ -c color.diff.newMovedAlternativeDimmed='white dim' \ diff --ignore-blank-lines --color-moved=dimmed-zebra --color-moved-ws=ignore-all-space --minimal 6d96656142aabcc862846ad7852422f0a8f14dbe:unpacked/_next/static/chunks/pages/_app.js _app.patched.js
Edit: I wrote a new
scripts/fix-diff-headers.js
(https://github.com/0xdevalias/chatgpt-source-watch/commit/f7deec74cca9127bc2f18279d83f7a088f3f35f1) script that seems to get it closer to working..⇒ ./scripts/fix-diff-headers.js _app.minimised.diff _app.fixed.diff Reading diff from _app.minimised.diff and writing corrected diff to _app.fixed.diff Diff successfully corrected and written to _app.fixed.diff ⇒ patch _app.original.js _app.fixed.diff -o _app.patched.js patching file _app.original.js 6 out of 90 hunks failed--saving rejects to _app.patched.js.rej ⇒ git -c color.diff.oldMoved='white dim' \ -c color.diff.oldMovedAlternative='white dim' \ -c color.diff.newMoved='white dim' \ -c color.diff.newMovedAlternative='white dim' \ -c color.diff.newMovedDimmed='white dim' \ -c color.diff.newMovedAlternativeDimmed='white dim' \ diff --ignore-blank-lines --color-moved=dimmed-zebra --color-moved-ws=ignore-all-space --minimal 6d96656142aabcc862846ad7852422f0a8f14dbe:unpacked/_next/static/chunks/pages/_app.js _app.patched.js
That way, instead of just seeing a giant chunk of 'removed' and a giant chunk of 'added' when this module moved; we can see the dimmed lines which were moved (but otherwise unchanged), and spend less time/effort having to consider them:
An even better version of this would be a
diff --color-moved
implementation that was able to ignore identifier names changing entirely (in which case I suspect it would mark this entire block as 'moved but not changed'); but as far as I am currently aware, the only way to do that would be to basically re-implement the wholediff --color-moved
logic ourselves; which would basically end up with us writing our own AST differ I believe.Though the way I was originally intending on fixing/improving the diff minimiser script was to try just maintaining the ANSI colouring of the input lines, and outputting the lines with the same colouring at the end.
Originally posted by @0xdevalias in https://github.com/0xdevalias/chatgpt-source-watch/issues/10#issuecomment-2073982692
There can be a lot of 'noise' when diffing minimised bundled code, as the bundler will often change the minified variable names it uses at times between builds (even if the rest of the code hasn't changed)
We can attempt to reduce this by using non-default git diff modes such as
patience
/histogram
/minimal
:Musings
See Also
Additional links to review