0xdevalias / chatgpt-source-watch

Analyzing the evolution of ChatGPT's codebase through time with curated archives and scripts
https://github.com/0xdevalias/chatgpt-source-watch/blob/main/CHANGELOG.md
Other
274 stars 16 forks source link

Explore AST based diff tools #3

Open 0xdevalias opened 9 months ago

0xdevalias commented 9 months ago

There can be a lot of 'noise' when diffing minimised bundled code, as the bundler will often change the minified variable names it uses at times between builds (even if the rest of the code hasn't changed)

We can attempt to reduce this by using non-default git diff modes such as patience / histogram / minimal:

⇒ git diff --diff-algorithm=default -- unpacked/_next/static/chunks/pages/_app.js | wc -l
  116000

⇒ git diff --diff-algorithm=patience -- unpacked/_next/static/chunks/pages/_app.js | wc -l
   35826

⇒ git diff --diff-algorithm=histogram -- unpacked/_next/static/chunks/pages/_app.js | wc -l
   35835

⇒ git diff --diff-algorithm=minimal -- unpacked/_next/static/chunks/pages/_app.js | wc -l
   35844

Musings

⭐ Suggestion

It would be cool if ast-grep was able to show a diff between 2 files, but do it using the AST rather than just a raw text compare. Ideally we would be able to provide options to this, such as ignoring chunks where the only change is to a variable/function name (eg. for diffing minimised JavaScript webpack builds)

Ideally the output would be text still (not the AST tree), but the actually diffing could be done at the AST level.

💻 Use Cases

This would be really useful for minimising the noise when diffing minimised source builds looking for the 'real changes' between the builds (not just minimised variable names churning, etc)

Looking through current diff output formats shows all of the variable name changes as well, which equates to a lot of noise while looking for the relevant changes.

Some alternative potential workarounds I've considered are either pre-processing the files to standardize their variable/function names; and/or post-processing the diff output to try and detect when the only changes in a chunk are variable/function names, and then suppressing that chunk. Currently I'm just relying on git diff --diff-algorithm=minimal -- thefile.js

Originally posted by @0xdevalias in https://github.com/ast-grep/ast-grep/issues/901

See Also

Additional links to review

HerringtonDarkholme commented 9 months ago

I would recommend difftastic for this! Actually rspack has already used it for checking diff between its output with that of webpack.

0xdevalias commented 9 months ago

I would recommend difftastic for this

rspack has already used it for checking diff between its output with that of webpack

@HerringtonDarkholme Interesting.. do you know if they did so while suppressing the 'noise' of changed variables? Or was it more just generally to ensure they were doing compatible things. I had a quick google, but didn't seem to turn up anything specific beyond the repo/etc:

Edit: Opened the following issue on difftastic:

Edit 2: And this one on diffsitter:

I was also just re-reading through the diffsitter README and noticed this section that I somehow missed in the past; which sounds like it might be exactly like what I want:

0xdevalias commented 9 months ago

Though playing with diffsitter just now, it's output format seems to leave a lot to be desired compared to typical git diff unified output; and it doesn't show any context/etc currently:

eg. On a very minimal example, diffsitter:

image

⇒ git difftool --tool diffsitter HEAD~1 HEAD -- unpacked/_next/static/\[buildHash\]/_buildManifest.js
/var/folders/j4/kxtq1cjs1l98xfqncjbsbx1c0000gn/T//git-blob-AOnHKy/_buildManifest.js
/var/folders/j4/kxtq1cjs1l98xfqncjbsbx1c0000gn/T//git-blob-ahyGIo/_buildManifest.js
===================================================================================

80:
---
-     "/search": ["static/chunks/pages/search-8da35bbb0f092dc3.js"],

80:
---
+     "/search": ["static/chunks/pages/search-d835393483b5432a.js"],

138:
----
+   "static/chunks/5054-e2060ddbea2abdb7.js"

138:
----
-   "static/chunks/5054-8ad3d13d663a6185.js"

Vs git diff (with delta):

image

⇒ git diff HEAD~1 HEAD -- unpacked/_next/static/\[buildHash\]/_buildManifest.js
diff --git a/unpacked/_next/static/[buildHash]/_buildManifest.js b/unpacked/_next/static/[buildHash]/_buildManifest.js
index 851a8f0..5004cc7 100644
--- a/unpacked/_next/static/[buildHash]/_buildManifest.js
+++ b/unpacked/_next/static/[buildHash]/_buildManifest.js
@@ -78,7 +78,7 @@
     "/payments/success-trial": [
       "static/chunks/pages/payments/success-trial-84597e34390c1506.js",
     ],
-    "/search": ["static/chunks/pages/search-8da35bbb0f092dc3.js"],
+    "/search": ["static/chunks/pages/search-d835393483b5432a.js"],
     "/share/e/[[...shareParams]]": [
       "static/chunks/pages/share/e/[[...shareParams]]-899e50f90dac9ff5.js",
     ],
@@ -136,6 +136,6 @@
   "static/chunks/5017-f7c5e142fc7f0516.js",
   "static/chunks/3975-37a9301353b29c5d.js",
   "static/chunks/3754-ae5dc2fb759ecfc1.js",
-  "static/chunks/5054-8ad3d13d663a6185.js"
+  "static/chunks/5054-e2060ddbea2abdb7.js"
 )),
   self.__BUILD_MANIFEST_CB && self.__BUILD_MANIFEST_CB();

This is the diffsitter config I was using:

// Default: `diffsitter dump-default-config`
// See also: https://github.com/afnanenayet/diffsitter/blob/v0.8.1/assets/sample_config.json5
// Colours: `color256`, `black`, `red`, `green`, `yellow`, `blue`, `magenta`, `cyan`, `white`
{
  "formatting": {
    "default": "unified",
    "unified": {
      "addition": {
        "highlight": "green",
        "regular-foreground": "green",
        "emphasized-foreground": "white",
        "bold": true,
        "underline": false,
        "prefix": "+ "
      },
      "deletion": {
        "highlight": "red",
        "regular-foreground": "red",
        "emphasized-foreground": "white",
        "bold": true,
        "underline": false,
        "prefix": "- "
      }
    },
    "json": {
      "pretty_print": false
    },
    "custom": {}
  },
  "grammar": {
    "dylib-overrides": null,
    "file-associations": {
      "js": "typescript",
      "jsx": "tsx"
    },
  },
  "input-processing": {
    "split-graphemes": true,
    // You can exclude different tree sitter node types - this rule takes precedence over `include_kinds`.
    "exclude-kinds": null,
    // "exclude-kinds": ["string"],
    // You can specifically allow only certain tree sitter node types
    "include-kinds": null
    // "include-kinds": ["method_definition"],
  },
  // Specify a fallback command if diffsitter can't parse the given input
  // files. This is invoked by diffsitter as:
  //
  // ```sh
  // ${fallback_cmd} ${old} ${new}
  // ```
  "fallback-cmd": null,
  // "fallback-cmd": "diff",
}

And this is the .gitconfig I was using to run it as a git difftool:

# https://github.com/afnanenayet/diffsitter
[difftool "diffsitter"]
  cmd = diffsitter "$LOCAL" "$REMOTE"

# https://github.com/afnanenayet/diffsitter
[difftool "diffsitter-debug"]
  cmd = diffsitter --debug "$LOCAL" "$REMOTE"

Running it with --debug allows us to see the steps it takes during processing, and how long they take. On our small example file above, the output looks like this:

diffsitter debug output from minimal example ```bash ⇒ git difftool --tool diffsitter-debug --color=always HEAD~1 HEAD -- unpacked/_next/static/\[buildHash\]/_buildManifest.js 2024-01-30T07:09:30.627Z DEBUG diffsitter > Checking if /var/folders/j4/kxtq1cjs1l98xfqncjbsbx1c0000gn/T//git-blob-oujkn7/_buildManifest.js can be parsed 2024-01-30T07:09:30.627Z INFO libdiffsitter::parse > Deduced language "typescript" from extension "js" provided from user mappings 2024-01-30T07:09:30.627Z INFO libdiffsitter::parse > Using tree-sitter parser for language typescript 2024-01-30T07:09:30.627Z INFO libdiffsitter::parse > Succeeded loading grammar for typescript 2024-01-30T07:09:30.627Z DEBUG diffsitter > Checking if /var/folders/j4/kxtq1cjs1l98xfqncjbsbx1c0000gn/T//git-blob-s3LsC2/_buildManifest.js can be parsed 2024-01-30T07:09:30.627Z INFO libdiffsitter::parse > Deduced language "typescript" from extension "js" provided from user mappings 2024-01-30T07:09:30.627Z INFO libdiffsitter::parse > Using tree-sitter parser for language typescript 2024-01-30T07:09:30.627Z INFO libdiffsitter::parse > Succeeded loading grammar for typescript 2024-01-30T07:09:30.627Z DEBUG diffsitter > Extensions for both input files are supported 2024-01-30T07:09:30.627Z DEBUG libdiffsitter > Reading /var/folders/j4/kxtq1cjs1l98xfqncjbsbx1c0000gn/T//git-blob-oujkn7/_buildManifest.js to string 2024-01-30T07:09:30.627Z INFO libdiffsitter > Will deduce filetype from file extension 2024-01-30T07:09:30.627Z INFO libdiffsitter::parse > Deduced language "typescript" from extension "js" provided from user mappings 2024-01-30T07:09:30.627Z INFO libdiffsitter::parse > Using tree-sitter parser for language typescript 2024-01-30T07:09:30.627Z INFO libdiffsitter::parse > Succeeded loading grammar for typescript 2024-01-30T07:09:30.627Z DEBUG libdiffsitter::parse > Constructed parser 2024-01-30T07:09:30.628Z DEBUG libdiffsitter::parse > Parsed AST 2024-01-30T07:09:30.628Z INFO TimerFinished > parse::parse_file(), Elapsed=968.896µs 2024-01-30T07:09:30.628Z DEBUG libdiffsitter > Reading /var/folders/j4/kxtq1cjs1l98xfqncjbsbx1c0000gn/T//git-blob-s3LsC2/_buildManifest.js to string 2024-01-30T07:09:30.628Z INFO libdiffsitter > Will deduce filetype from file extension 2024-01-30T07:09:30.628Z INFO libdiffsitter::parse > Deduced language "typescript" from extension "js" provided from user mappings 2024-01-30T07:09:30.628Z INFO libdiffsitter::parse > Using tree-sitter parser for language typescript 2024-01-30T07:09:30.628Z INFO libdiffsitter::parse > Succeeded loading grammar for typescript 2024-01-30T07:09:30.628Z DEBUG libdiffsitter::parse > Constructed parser 2024-01-30T07:09:30.629Z DEBUG libdiffsitter::parse > Parsed AST 2024-01-30T07:09:30.629Z INFO TimerFinished > parse::parse_file(), Elapsed=590.376µs 2024-01-30T07:09:30.629Z INFO TimerFinished > ast::from_ts_tree(), Elapsed=388.582µs 2024-01-30T07:09:30.630Z INFO TimerFinished > ast::process(), Elapsed=1.083248ms 2024-01-30T07:09:30.630Z INFO TimerFinished > ast::from_ts_tree(), Elapsed=309.698µs 2024-01-30T07:09:30.631Z INFO TimerFinished > ast::process(), Elapsed=997.049µs 2024-01-30T07:09:30.631Z INFO TimerFinished > diff::compute_edit_script(), Elapsed=127.357µs 2024-01-30T07:09:30.631Z INFO libdiffsitter::render::unified > Using stack style vertical for title 2024-01-30T07:09:30.631Z DEBUG libdiffsitter::render::unified > Printing hunk (lines 80 - 80) 2024-01-30T07:09:30.631Z DEBUG libdiffsitter::render::unified > Title string has length of 4 2024-01-30T07:09:30.631Z DEBUG libdiffsitter::render::unified > Printing line 80 2024-01-30T07:09:30.631Z DEBUG libdiffsitter::render::unified > End line 80 2024-01-30T07:09:30.631Z DEBUG libdiffsitter::render::unified > End hunk (lines 80 - 80) 2024-01-30T07:09:30.631Z DEBUG libdiffsitter::render::unified > Printing hunk (lines 80 - 80) 2024-01-30T07:09:30.631Z DEBUG libdiffsitter::render::unified > Title string has length of 4 2024-01-30T07:09:30.631Z DEBUG libdiffsitter::render::unified > Printing line 80 2024-01-30T07:09:30.631Z DEBUG libdiffsitter::render::unified > End line 80 2024-01-30T07:09:30.631Z DEBUG libdiffsitter::render::unified > End hunk (lines 80 - 80) 2024-01-30T07:09:30.631Z DEBUG libdiffsitter::render::unified > Printing hunk (lines 138 - 138) 2024-01-30T07:09:30.631Z DEBUG libdiffsitter::render::unified > Title string has length of 5 2024-01-30T07:09:30.631Z DEBUG libdiffsitter::render::unified > Printing line 138 2024-01-30T07:09:30.631Z DEBUG libdiffsitter::render::unified > End line 138 2024-01-30T07:09:30.631Z DEBUG libdiffsitter::render::unified > End hunk (lines 138 - 138) 2024-01-30T07:09:30.631Z DEBUG libdiffsitter::render::unified > Printing hunk (lines 138 - 138) 2024-01-30T07:09:30.631Z DEBUG libdiffsitter::render::unified > Title string has length of 5 2024-01-30T07:09:30.631Z DEBUG libdiffsitter::render::unified > Printing line 138 2024-01-30T07:09:30.631Z DEBUG libdiffsitter::render::unified > End line 138 2024-01-30T07:09:30.631Z DEBUG libdiffsitter::render::unified > End hunk (lines 138 - 138) /var/folders/j4/kxtq1cjs1l98xfqncjbsbx1c0000gn/T//git-blob-oujkn7/_buildManifest.js /var/folders/j4/kxtq1cjs1l98xfqncjbsbx1c0000gn/T//git-blob-s3LsC2/_buildManifest.js =================================================================================== 80: --- - "/search": ["static/chunks/pages/search-8da35bbb0f092dc3.js"], 80: --- + "/search": ["static/chunks/pages/search-d835393483b5432a.js"], 138: ---- + "static/chunks/5054-e2060ddbea2abdb7.js" 138: ---- - "static/chunks/5054-8ad3d13d663a6185.js" ```

Reducing that output to just show the relevant timing events:

// Start
2024-01-30T07:09:30.627Z INFO  libdiffsitter::parse > Deduced language "typescript" from extension "js" provided from user mappings
// ..snip..
2024-01-30T07:09:30.628Z INFO  TimerFinished        > parse::parse_file(), Elapsed=968.896µs
// ..snip..
2024-01-30T07:09:30.629Z INFO  TimerFinished        > parse::parse_file(), Elapsed=590.376µs
2024-01-30T07:09:30.629Z INFO  TimerFinished        > ast::from_ts_tree(), Elapsed=388.582µs
2024-01-30T07:09:30.630Z INFO  TimerFinished        > ast::process(), Elapsed=1.083248ms
2024-01-30T07:09:30.630Z INFO  TimerFinished        > ast::from_ts_tree(), Elapsed=309.698µs
2024-01-30T07:09:30.631Z INFO  TimerFinished        > ast::process(), Elapsed=997.049µs
2024-01-30T07:09:30.631Z INFO  TimerFinished        > diff::compute_edit_script(), Elapsed=127.357µs
// ..snip..
2024-01-30T07:09:30.631Z DEBUG libdiffsitter::render::unified > End hunk (lines 138 - 138)
// Finish

We can then see where the real performance hit is when running this against a MUCH larger file/diff (the file being diffed is ~8.4MB; ~249,128 lines long)

diffsitter debug output from large example ```bash ⇒ du -sh unpacked/_next/static/chunks/pages/_app.js 8.4M unpacked/_next/static/chunks/pages/_app.js ⇒ cat unpacked/_next/static/chunks/pages/_app.js | wc -l 249128 ⇒ git difftool --tool diffsitter-debug HEAD~1 HEAD -- unpacked/_next/static/chunks/pages/_app.js 2024-01-30T07:07:58.282Z DEBUG diffsitter > Checking if /var/folders/j4/kxtq1cjs1l98xfqncjbsbx1c0000gn/T//git-blob-wpgXXK/_app.js can be parsed 2024-01-30T07:07:58.282Z INFO libdiffsitter::parse > Deduced language "typescript" from extension "js" provided from user mappings 2024-01-30T07:07:58.282Z INFO libdiffsitter::parse > Using tree-sitter parser for language typescript 2024-01-30T07:07:58.282Z INFO libdiffsitter::parse > Succeeded loading grammar for typescript 2024-01-30T07:07:58.282Z DEBUG diffsitter > Checking if /var/folders/j4/kxtq1cjs1l98xfqncjbsbx1c0000gn/T//git-blob-6fi10i/_app.js can be parsed 2024-01-30T07:07:58.282Z INFO libdiffsitter::parse > Deduced language "typescript" from extension "js" provided from user mappings 2024-01-30T07:07:58.282Z INFO libdiffsitter::parse > Using tree-sitter parser for language typescript 2024-01-30T07:07:58.282Z INFO libdiffsitter::parse > Succeeded loading grammar for typescript 2024-01-30T07:07:58.282Z DEBUG diffsitter > Extensions for both input files are supported 2024-01-30T07:07:58.287Z DEBUG libdiffsitter > Reading /var/folders/j4/kxtq1cjs1l98xfqncjbsbx1c0000gn/T//git-blob-wpgXXK/_app.js to string 2024-01-30T07:07:58.287Z INFO libdiffsitter > Will deduce filetype from file extension 2024-01-30T07:07:58.291Z INFO libdiffsitter::parse > Deduced language "typescript" from extension "js" provided from user mappings 2024-01-30T07:07:58.291Z INFO libdiffsitter::parse > Using tree-sitter parser for language typescript 2024-01-30T07:07:58.291Z INFO libdiffsitter::parse > Succeeded loading grammar for typescript 2024-01-30T07:07:58.291Z DEBUG libdiffsitter::parse > Constructed parser 2024-01-30T07:07:59.790Z DEBUG libdiffsitter::parse > Parsed AST 2024-01-30T07:07:59.791Z INFO TimerFinished > parse::parse_file(), Elapsed=1.50364822s 2024-01-30T07:07:59.795Z DEBUG libdiffsitter > Reading /var/folders/j4/kxtq1cjs1l98xfqncjbsbx1c0000gn/T//git-blob-6fi10i/_app.js to string 2024-01-30T07:07:59.795Z INFO libdiffsitter > Will deduce filetype from file extension 2024-01-30T07:07:59.800Z INFO libdiffsitter::parse > Deduced language "typescript" from extension "js" provided from user mappings 2024-01-30T07:07:59.800Z INFO libdiffsitter::parse > Using tree-sitter parser for language typescript 2024-01-30T07:07:59.800Z INFO libdiffsitter::parse > Succeeded loading grammar for typescript 2024-01-30T07:07:59.800Z DEBUG libdiffsitter::parse > Constructed parser 2024-01-30T07:08:01.160Z DEBUG libdiffsitter::parse > Parsed AST 2024-01-30T07:08:01.160Z INFO TimerFinished > parse::parse_file(), Elapsed=1.364231801s 2024-01-30T07:08:01.985Z INFO TimerFinished > ast::from_ts_tree(), Elapsed=825.767557ms 2024-01-30T07:08:02.972Z INFO TimerFinished > ast::process(), Elapsed=1.812168007s 2024-01-30T07:08:03.814Z INFO TimerFinished > ast::from_ts_tree(), Elapsed=841.781925ms 2024-01-30T07:08:04.787Z INFO TimerFinished > ast::process(), Elapsed=1.815188386s 2024-01-30T07:20:22.478Z INFO TimerFinished > diff::compute_edit_script(), Elapsed=737.685108096s 2024-01-30T07:20:22.502Z INFO libdiffsitter::render::unified > Using stack style horizontal for title 2024-01-30T07:20:22.502Z DEBUG libdiffsitter::render::unified > Printing hunk (lines 52698 - 52698) 2024-01-30T07:20:22.502Z DEBUG libdiffsitter::render::unified > Title string has length of 7 2024-01-30T07:20:22.502Z DEBUG libdiffsitter::render::unified > Printing line 52698 2024-01-30T07:20:22.502Z DEBUG libdiffsitter::render::unified > End line 52698 // ..snip 17,524 lines.. 2024-01-30T07:20:22.822Z DEBUG libdiffsitter::render::unified > Printing line 96300 2024-01-30T07:20:22.822Z DEBUG libdiffsitter::render::unified > End line 96300 2024-01-30T07:20:22.822Z DEBUG libdiffsitter::render::unified > Printing line 96301 // I'm not sure if this actually finished successfully, or just silently crashed at this point... I think it might have crashed, as I would have expected to see a log line like: // DEBUG libdiffsitter::render::unified > End hunk // As well as the actual diff output ```

Reducing that output to just show the relevant timing events:

// Start
2024-01-30T07:07:58.282Z INFO  libdiffsitter::parse > Deduced language "typescript" from extension "js" provided from user mappings
// ..snip..
2024-01-30T07:07:59.791Z INFO  TimerFinished        > parse::parse_file(), Elapsed=1.50364822s
// ..snip..
2024-01-30T07:08:01.160Z INFO  TimerFinished        > parse::parse_file(), Elapsed=1.364231801s
2024-01-30T07:08:01.985Z INFO  TimerFinished        > ast::from_ts_tree(), Elapsed=825.767557ms
2024-01-30T07:08:02.972Z INFO  TimerFinished        > ast::process(), Elapsed=1.812168007s
2024-01-30T07:08:03.814Z INFO  TimerFinished        > ast::from_ts_tree(), Elapsed=841.781925ms
2024-01-30T07:08:04.787Z INFO  TimerFinished        > ast::process(), Elapsed=1.815188386s
2024-01-30T07:20:22.478Z INFO  TimerFinished        > diff::compute_edit_script(), Elapsed=737.685108096s
// ..snip..
2024-01-30T07:20:22.822Z DEBUG libdiffsitter::render::unified > Printing line 96301
// Finish

We can see that ~12.29 minutes was spent in diff::compute_edit_script(), and then it looks like the script might have silently crashed, as I didn't see the diff output, nor the final DEBUG libdiffsitter::render::unified > End hunk line that I would have expected to see.

Note: I was piping this command to subl (Sublime Text), despite what the code block above was edited to look like; I haven't tried running this again yet printing directly to STDOUT/a file. I also haven't tried running this without debug mode to see if it somehow doesn't crash that way.

As a point of comparison, difftastic was seemingly able to diff this file in ~2.46sec total:

⇒ time git difftool --tool difftastic HEAD~1 HEAD -- unpacked/_next/static/chunks/pages/_app.js | subl
git difftool --tool difftastic HEAD~1 HEAD --   2.63s user 0.46s system 47% cpu 6.494 total
subl  0.01s user 0.02s system 0% cpu 6.746 total

Though with a lot of this printed among the diff output:

_app.js --- 1/674 --- Text (8.39 MiB exceeded DFT_BYTE_LIMIT)

It seems that when DFT_BYTE_LIMIT is exceeded, difftastic falls back to just doing a basic text diff, so those above timings aren't really a fair comparison. We can override this with --byte-limit and DFT_BYTE_LIMIT:

⇒ difftastic -h

Difftastic 0.52.0

..snip..

OPTIONS:

..snip..

        --byte-limit <LIMIT>           Use a text diff if either input file exceeds this size. [env:
                                       DFT_BYTE_LIMIT=] [default: 1000000]

..snip..

        --graph-limit <LIMIT>          Use a text diff if the structural graph exceed this number of
                                       nodes in memory. [env: DFT_GRAPH_LIMIT=] [default: 3000000]

Let's set this to something much higher than we need for now:

20*1024*1024 = 20971520

Updating our .gitconfig:

# https://github.com/Wilfred/difftastic
[difftool "difftastic"]
  cmd = difft --byte-limit 20971520 "$LOCAL" "$REMOTE"

Running that command again:

 ⇒ time git difftool --tool difftastic HEAD~1 HEAD -- unpacked/_next/static/chunks/pages/_app.js | subl
git difftool --tool difftastic HEAD~1 HEAD --   12.42s user 1.10s system 79% cpu 17.043 total
subl  0.01s user 0.02s system 0% cpu 17.248 total

It takes a bit longer, but then we hit a different error/limit:

_app.js --- 1/674 --- Text (2 JavaScript parse errors, exceeded DFT_PARSE_ERROR_LIMIT)

From the help:

⇒ difft --help | subl

..snip..

        --parse-error-limit <LIMIT>
            Use a text diff if the number of parse errors exceeds this value.

            [env: DFT_PARSE_ERROR_LIMIT=]
            [default: 0]

There seem to be 674 matches of DFT_PARSE_ERROR_LIMIT in the output, which corresponds to the 674 chunks of output.

I didn't see any specific errors output related to the parsing though.. I wonder how we can see what the actual issue was?

It seems we can dump the AST either from tree-sitter directly, or the difftastic filtered version of it as a debug:

⇒ difft --help | subl

..snip..

DEBUG OPTIONS:
        --dump-syntax <PATH>
            Parse a single file with tree-sitter and display the difftastic syntax tree.

        --dump-ts <PATH>
            Parse a single file with tree-sitter and display the tree-sitter parse tree.

Full tree-sitter AST:

⇒ time difft --dump-ts unpacked/_next/static/chunks/pages/_app.js > _app.js-difftastic-tree-sitter.ast
difft --dump-ts unpacked/_next/static/chunks/pages/_app.js >   5.08s user 9.49s system 95% cpu 15.312 total

⇒ du -h _app.js-difftastic-tree-sitter.ast
224M    _app.js-difftastic-tree-sitter.ast

⇒ cat _app.js-difftastic-tree-sitter.ast| wc -l
 2533687

difftastic 'syntax' AST:

⇒ time difft --dump-syntax unpacked/_next/static/chunks/pages/_app.js > _app.js-difftastic.ast
difft --dump-syntax unpacked/_next/static/chunks/pages/_app.js >   434.80s user 43.33s system 93% cpu 8:29.31 total

⇒ du -h _app.js-difftastic.ast
2.2G    _app.js-difftastic.ast

⇒ cat _app.js-difftastic.ast | wc -l
 10571815
0xdevalias commented 9 months ago

After the above explorations, I ended up taking a different tact and started exploring the 'post processing git diff' approach.

Initial PoC tests seemed to show some merit, so I hacked them together into a script that can filter an existing git diff passed in from a file, or STDIN:

0xdevalias commented 9 months ago

With the current PoC implementations in the diff minimiser, we're grouping by the diff chunks, and then by added/removed lines within those chunks.

Sometimes a section of code will churn and move a large amount of lines from one chunk, to another chunk, which won't get noticed by the minimiser currently.

It would be good if we could process these in a similar way to git diff's color-moved, with our 'ignore identifier changes', so that we can suppress both the 'added' and 'removed' chunks if they're otherwise unchanged (or minimise the diff between them to only show what has changed, rather than showing them as large 'disconnected blocks')


Edit: See also:

--

VSCode recently added a new feature called move code detection. The algorithm is surprisingly simple: after normal diff, calculate similarity(deletion, insertion) for every deletion & insertion pair. The code can be found here https://github.com/microsoft/vscode/blob/166097a20cbd06d10d255ef561837c439f372de3/src/vs/editor/common/diff/defaultLinesDiffComputer/computeMovedLines.ts#L44-L85.

But intuitively demonstrating the moving relationship on TUI might be a challenge.

Originally posted by @QuarticCat in https://github.com/Wilfred/difftastic/issues/508#issuecomment-1734691395

--

Have you looked into git diff's --color-moved at all?

Personally I tend to use the zebra mode of it. From my git aliases:

  # https://git-scm.com/docs/git-diff#Documentation/git-diff.txt---color-movedltmodegt
  # https://git-scm.com/docs/git-diff#Documentation/git-diff.txt---color-moved-wsltmodesgt
  # https://git-scm.com/docs/git-config#Documentation/git-config.txt-color
  # https://git-scm.com/docs/git-config#Documentation/git-config.txt-colordiffltslotgt
  diff-refactor = \
    -c color.diff.oldMoved='white dim' \
    -c color.diff.oldMovedAlternative='white dim' \
    -c color.diff.newMoved='white dim' \
    -c color.diff.newMovedAlternative='white dim' \
    -c color.diff.newMovedDimmed='white dim' \
    -c color.diff.newMovedAlternativeDimmed='white dim' \
    diff --ignore-blank-lines --color-moved=dimmed-zebra --color-moved-ws=ignore-all-space --minimal

Originally posted by @0xdevalias in https://github.com/Wilfred/difftastic/issues/539#issuecomment-1916033305

0xdevalias commented 7 months ago

A lot of this will probably be uninteresting/useless noise, but I just made a repo and pushed all of my hacky WIP/PoC AST exploration scripts to it so they're captured in one place:

What might be of interest though, is that I pushed the current hacky state of my diff minimiser code there too. I haven't checked it for a good few weeks, so can't even remember if it's in a runnable state currently, but if nothing else, there might be some ideas hidden away in there that help explain what I was referring to above:

When I get a chance to get back to them, I plan to finish my research/refinements on the best method(s), and then clean it all up back to a single useful script that I will commit to this repo. But until then, those may be of some interest to someone.

Originally posted by @0xdevalias in https://github.com/0xdevalias/chatgpt-source-watch/issues/10#issuecomment-2024566283

0xdevalias commented 7 months ago

Here's a fun little hack I just thought up (while the diff minimiser script still breaks --color-moved):

  • Diff 2 files, run them through the minimiser, save that diff to a file
  • Apply the output of the minimised diff to the original file, save that to a file
  • Re-run diff with --color-moved between the original file, and the original+minimisedDiffPatch file

Currently this doesn't work fully, as I think I'm not properly updating the line metadata in the diff hunks when I've removed sections from them in the diff minimiser.. but something like this (in theory):

⇒ git diff 6d96656142aabcc862846ad7852422f0a8f14dbe:unpacked/_next/static/chunks/pages/_app.js 6eab9108d9ed3124b4ba757ee9f29e892082deb5:unpacked/_next/static/chunks/pages/_app.js | ./scripts/ast-poc/diff-minimiser.js 2>/dev/null | \
sed 's|a/unpacked/_next/static/chunks/pages/_app.js|a/_app.original.js|g' | \
sed 's|b/unpacked/_next/static/chunks/pages/_app.js|b/_app.patched.js|g' \
> _app.minimised.diff

⇒ git show 6d96656142aabcc862846ad7852422f0a8f14dbe:unpacked/_next/static/chunks/pages/_app.js > _app.original.js

⇒ git apply _app.minimised.diff

⇒ git -c color.diff.oldMoved='white dim' \
    -c color.diff.oldMovedAlternative='white dim' \
    -c color.diff.newMoved='white dim' \
    -c color.diff.newMovedAlternative='white dim' \
    -c color.diff.newMovedDimmed='white dim' \
    -c color.diff.newMovedAlternativeDimmed='white dim' \
    diff --ignore-blank-lines --color-moved=dimmed-zebra --color-moved-ws=ignore-all-space --minimal 6d96656142aabcc862846ad7852422f0a8f14dbe:unpacked/_next/static/chunks/pages/_app.js _app.patched.js

Edit: I wrote a new scripts/fix-diff-headers.js (https://github.com/0xdevalias/chatgpt-source-watch/commit/f7deec74cca9127bc2f18279d83f7a088f3f35f1) script that seems to get it closer to working..

⇒ ./scripts/fix-diff-headers.js _app.minimised.diff _app.fixed.diff
Reading diff from _app.minimised.diff and writing corrected diff to _app.fixed.diff
Diff successfully corrected and written to _app.fixed.diff

⇒ patch _app.original.js _app.fixed.diff -o _app.patched.js
patching file _app.original.js
6 out of 90 hunks failed--saving rejects to _app.patched.js.rej

⇒ git -c color.diff.oldMoved='white dim' \
    -c color.diff.oldMovedAlternative='white dim' \
    -c color.diff.newMoved='white dim' \
    -c color.diff.newMovedAlternative='white dim' \
    -c color.diff.newMovedDimmed='white dim' \
    -c color.diff.newMovedAlternativeDimmed='white dim' \
    diff --ignore-blank-lines --color-moved=dimmed-zebra --color-moved-ws=ignore-all-space --minimal 6d96656142aabcc862846ad7852422f0a8f14dbe:unpacked/_next/static/chunks/pages/_app.js _app.patched.js

That way, instead of just seeing a giant chunk of 'removed' and a giant chunk of 'added' when this module moved; we can see the dimmed lines which were moved (but otherwise unchanged), and spend less time/effort having to consider them:

image

An even better version of this would be a diff --color-moved implementation that was able to ignore identifier names changing entirely (in which case I suspect it would mark this entire block as 'moved but not changed'); but as far as I am currently aware, the only way to do that would be to basically re-implement the whole diff --color-moved logic ourselves; which would basically end up with us writing our own AST differ I believe.

Though the way I was originally intending on fixing/improving the diff minimiser script was to try just maintaining the ANSI colouring of the input lines, and outputting the lines with the same colouring at the end.

Originally posted by @0xdevalias in https://github.com/0xdevalias/chatgpt-source-watch/issues/10#issuecomment-2073982692