Simple diff analysis based on strings and identifiers

0xdevalias / chatgpt-source-watch

Analyzing the evolution of ChatGPT's codebase through time with curated archives and scripts

https://github.com/0xdevalias/chatgpt-source-watch/blob/main/CHANGELOG.md

Other

271 stars 16 forks source link

Simple diff analysis based on strings and identifiers #10

Open michaelskyba opened 6 months ago

michaelskyba commented 6 months ago

This is an idea for a type of analysis of code diffs. This issue is just for tracking notes and ideas

Example input 1

+            guidance: function (e) {
+              var t = e.isSearchable,
+                n = e.isMulti,
+                r = e.isDisabled,
+                i = e.tabSelectsValue;
+              switch (e.context) {
+                case "menu":
+                  return "Use Up and Down to choose options"
+                    .concat(
+                      r
+                        ? ""
+                        : ", press Enter to select the currently focused option",
+                      ", press Escape to exit the menu",
+                    )
+                    .concat(
+                      i
+                        ? ", press Tab to select the option and exit the menu"
+                        : "",
+                      ".",
+                    );

Here the extraction might be

guidance
e
t
isSearchable
n
isMulti
isDisabled
i
tabSelectsValue
context
"menu"
"Use Up and Down to choose options"
concat
r
""
", press Enter to select the currently focuse doption"
", press Escape to exit the menu"
", press Tab to select the option and exit the menu"
"."

Of course, this gives you far less information than the original, but I think it could be a good trade-off in cases where you want to look at the diff a little bit but don't have time to see everything.

Example input 2

Since common generic ones like e, t, n, "", and "." would show up frequently in any context, they would have already been seen in the past, and therefore filtered out. You'd more focus on the e.g. "Use Up and Down to choose options", with some kind of convenient way to jump back to see it in-context in the code.

For input like the following:

+      var a = n(72843);
+      function s(e, t) {
+        for (var n = 0; n < t.length; n++) {
+          var r = t[n];
+          (r.enumerable = r.enumerable || !1),
+            (r.configurable = !0),
+            "value" in r && (r.writable = !0),
+            Object.defineProperty(e, (0, a.Z)(r.key), r);
+        }
+      }
+      function l(e, t, n) {
+        return (
+          t && s(e.prototype, t),
+          n && s(e, n),
+          Object.defineProperty(e, "prototype", { writable: !1 }),
+          e
+        );
+      }

, none of the names or strings would probably be new, and so you wouldn't see it at all. This is intended, because I can't gleam any conclusions from looking at it, and thus would prefer not to see it

Glenn's comments

https://twitter.com/_devalias/status/1770284997385277554

I think given the size of a lot of the JS files, and the diffs themselves; it would probably end up being a LOT of strings; which might be confusing when removed from the rest of the context of the surrounding code.

For large diffs I think it'd be a lot, but strings and names are a subset of the raw diff, so it should still be less work than a full manual analysis. The idea is to just visually filter through them until you see a name/string that looks interesting on their own, which could lead to something good in-context.

It should be fairly easy to prototype a script using babel parser and babel traverse though. You would add a rule or couple to the traverse so that it matches on whatever strings are called in the AST; and then output them to console or a file or similar.

Haven't worked with Babel but some relevant docs seem to be

Are there other AST parsers too? Would something like TreeSitter work? I'd generally prefer to avoid node.js if it's not required

Then you would just diff that output file of strings between one build and the next. If code moves around between builds it might introduce it’s own form of noise (but maybe git diff —color-moved would handle that still anyway)

I haven't seen enough diffs to exactly anticipate how these would look like but there might be different solutions like color-moved that could work depending on how it goes

I also noticed you liked some of my tweets about my more generalised diff minimiser; which would reduce the noise of things a fair bit overall as well. I still need to polish that and commit/upload it; been super busy lately and haven’t had a chance to yet.

https://github.com/0xdevalias/chatgpt-source-watch/issues/3

Feel free to open an issue on the ChatGPT Source watch repo about the string extractor idea + link back to these tweets/copy the relevant info in. I’d be happy to give some more pointers about it and/or include it in the repo if you wanted to work on it.

Yeah, I want to make a prototype and see if it will kind of work. I'm still not sure on the implementation, though; the most efficient system might be to integrate with a text editor, which makes it harder to be replicable

michaelskyba commented 6 months ago

According to a random post I read on Hacker News, difftastic uses tree-sitter too, so I suppose it would be general enough for this use case https://tree-sitter.github.io/tree-sitter/

0xdevalias commented 6 months ago

Since common generic ones like e, t, n, "", and "." would show up frequently in any context, they would have already been seen in the past, and therefore filtered out. You'd more focus on the e.g. "Use Up and Down to choose options"

@michaelskyba For the variable names like e, t, n, etc; if you're using an AST parser they'll show as a different node type to proper "" strings, so they'll be pretty easy to avoid.

with some kind of convenient way to jump back to see it in-context in the code.

@michaelskyba An easy way to connect from those strings back to the context in the code would definitely be interesting. Almost feels like something you could do with monaco editor (the core of vscode), and symbol links/similar:

I know @j4k0xb has done some cool stuff with using monaco as the web IDE for webcrack:

https://github.com/j4k0xb/webcrack

And I think @pionxzh 's wakaru also uses monaco (or plans to for the newer IDE; I can't remember exactly off the top of my head):

https://github.com/pionxzh/wakaru

As a first exploration/PoC, I would probably consider the contextual linking back to be out of scope though, as it has the ability to quickly become a complicated rabbithole/distraction from the core features.

For large diffs I think it'd be a lot, but strings and names are a subset of the raw diff, so it should still be less work than a full manual analysis. The idea is to just visually filter through them until you see a name/string that looks interesting on their own, which could lead to something good in-context.

@michaelskyba Mm, makes sense.

Are there other AST parsers too? Would something like TreeSitter work? I'd generally prefer to avoid node.js if it's not required

@michaelskyba Yeah, there are definitely lots of AST parsers out there. Technically you could probably use TreeSitter or similar for this.

I'm curious what your reasoning for wanting to avoid using node is though? Pretty much all of the scripts and automation in this repo currently are currently written in node/JS/TS; for many use cases it's a pretty useful tool.

If you want to go deep on this subject, there's heaps of stuff on my gists, eg.:

https://gist.github.com/0xdevalias

According to a random post I read on Hacker News, difftastic uses tree-sitter too, so I suppose it would be general enough for this use case

@michaelskyba Yup, tree-sitter is pretty generic.

Yeah, I want to make a prototype and see if it will kind of work. I'm still not sure on the implementation, though; the most efficient system might be to integrate with a text editor, which makes it harder to be replicable

@michaelskyba If you're happy to use node and babel parser/etc, then I can definitely help you get started with it. There are already some examples within this repo (albeit not using traverse I don't think):

https://github.com/search?q=repo%3A0xdevalias%2Fchatgpt-source-watch%20babel%20parser&type=code
- https://github.com/0xdevalias/chatgpt-source-watch/blob/1c127295d3aa2781b933a56d96bfb386380eb029/scripts/extract-webpack-urls.js

There are a few 'AST Explorer' tools that are super useful when working with AST's as well:

I would personally start by writing it as a standalone script that just writes to the console or a file, that way you can quickly 'prove the concept' and see if it will be beneficial. Then later on, if you wanted to enhance the UX (user experience) of using it by integrating it into a text editor / IDE / GUI / etc, that could be a path for further exploration.

0xdevalias commented 6 months ago

Here's a quick/dirty example that I hacked together with ChatGPT + some patterns from existing code. I haven't deeply tested it works properly, but from a quick skim through + basic test, it looks like it should probably work, or be close to doing so:

Example String Extraction Code

```js #!/usr/bin/env node const fs = require('fs'); const path = require('path'); const parser = require('@babel/parser'); const traverse = require('@babel/traverse').default; let DEBUG = false; // Initial debug state function displayUsage(scriptName) { console.log(`Usage: ${scriptName} Options: --debug Enable debug mode. -h, --help Display this usage information.`); } function parseCommandLineArguments() { const scriptName = path.basename(process.argv[1]); const args = process.argv.slice(2); if (args.includes('-h') || args.includes('--help')) { displayUsage(scriptName); process.exit(0); } const debugFlagIndex = args.indexOf('--debug'); if (debugFlagIndex !== -1) { DEBUG = true; args.splice(debugFlagIndex, 1); } if (args.length !== 1) { displayUsage(scriptName); console.error('\nError: Invalid number of arguments provided.'); process.exit(1); } const filePath = args[0]; let code; try { code = fs.readFileSync(filePath, 'utf8'); } catch (err) { console.error(`Error reading file: ${filePath}\n${err.message}`); process.exit(1); } return { code }; } function parseCodeToAST(code) { return parser.parse(code, { sourceType: 'module', plugins: [ // Add any necessary plugins here ], }); } function extractStrings(ast) { const strings = []; traverse(ast, { StringLiteral: ({ node }) => { strings.push(node.value); }, TemplateLiteral: ({ node }) => { node.quasis.forEach((element) => { strings.push(element.value.cooked); }); }, }); return strings; } function main() { const { code } = parseCommandLineArguments(); const ast = parseCodeToAST(code); const strings = extractStrings(ast); if (DEBUG) { console.log('[DEBUG] Extracted Strings:', strings); } else { console.log(strings.join('\n')); } } main(); ```

The main parts to pay attention to are parseCodeToAST and extractStrings; the logic within them might need some tweaking/etc though; and it could no doubt be extended to better capture the context they're being extracted from/etc to make it easier to jump back to them later, etc.

Example Output

```bash ⇒ ./scripts/extract-js-strings.js unpacked/_next/static/chunks/webpack.js define cannot be used indirect object function object static/chunks/ 1092ac6e 9784a43c queryString c029043f tsub-middleware 30750f44 303a24d8 420d684a schemaFilter auto-track legacyVideos 65fd0ec4 sso 5be645cc remoteMiddleware ajs-destination 0d9e9cd9 b9f778a8 adb5c70d . 660fa56f7d9f8841 f366c884f1a0069d 5c208462c867bb6d 11bb38d51776e36c 3ff460de12e80a25 1695899797a18685 c39541320838d2e1 7f0de228a3562cc0 35c9d0cb5854cadd 1aac68555afd4d6b 035f577202ad7345 fd18bd86faeaa7b0 e9a991b898301078 361ecd58fa0ac9f5 797361ea99a892a2 8596a5c250b2c4f7 b4746c7a1f512b5a 36f791c3f2dbf0ec 29104c892767faf2 415799798aaeeffc 04ab16a240f4ce97 32ab3d24800a1a80 3421eea06995d5df 0f0eee9f9d7a9a3a df174924c6968fe4 c865be216348dd51 2f6ad9ce262d966a 4c4cb925af3442d0 c63200a5c100c707 46b523e60584f44f 739643ec58c60aab 2fad09c9cd727ecc a08151d3efed095c 0c0ee75dab4b3a1d f0c19e9fbc4a2762 1ce750acd0106d6e b80d67713515dff2 d028af8c55d386c9 b41a5c7bc5017269 94b9b8a3298ce71f 352bbad25795ea42 60463b33da33b8b8 6ef5e476dad7ed16 2ae7c3fec20da7cb 73ef26c15091fb86 eecba8f575f628a2 0445a5d12846771b 9dc2787a40cce737 1a42259f0c3d91c1 6c7d9ca4e6244388 70979731ae1a472e c3edeef63a58e362 270e8342cc6087e0 a671d3e99d9a3215 ee7e1452e33739e6 b0eb71c1fc6253b5 4b3fe5232d753e7c 2177f28b937c71b0 583ec3a0c8764140 018b8b2946dc8b20 10400e67574b5a9a abf37f1db8be5c89 3a60974229840ffd c682389becf401d9 f235c81abfebf195 6d3d8ed11b0e9709 367b294ff3d2e4a0 5fa4b187af7d697f 42809fbb8b2f331e 01c9e3f110f368ae c6c0f36a9a82d8dd fa6fc1112abad268 87e726622393959c 84cbf69a0bf382da edae42c864b47324 bc4f77e47caaeac6 cbbca09eae978a72 79c8284e0121e7e4 d018f6e9a9ebea16 16a1a63dd7d6adcb 207481a6c8e707c7 5b368a3b1f211414 32ab39c72c4d4509 3b0e463179d66d5e 6abd0ae9c08306a9 36bb6cb6e7d3b8e4 4ed8f6b71d8bfe92 aead845693cd969c 1f34ff95cbccd78d d9cd8bc88bf780e8 92961eb1ec331066 550cbb4354158933 .js static/css/ f3d0bad461ffb80b 944a9aa5c00c7d0e d92eaad6fd257c73 .css object return this object _N_E: script src data-webpack script utf-8 nonce data-webpack / anonymous timeout undefined Module __esModule undefined nextjs#bundler https://cdn.oaistatic.com/_next/ link stylesheet text/css load load missing Loading CSS chunk failed. ( ) CSS_CHUNK_LOAD_FAILED / anonymous link data-href href stylesheet style data-href load missing Loading chunk failed. ( : ) ChunkLoadError chunk- ```

While the above example shows that the concept works, running it on a larger file ends up with a LOT more output content (eg. 57k lines in this case):

⇒ ./scripts/extract-js-strings.js unpacked/_next/static/chunks/pages/_app.js | wc -l
   57025

Obviously most of that wouldn't change between builds though, so would end up being a lot less all in all.

Edit: Committed the above basic script in 210a87098dfca4d69dfcff9642fbefdb74f8b514 for future reference/usage.

0xdevalias commented 6 months ago

Then you would just diff that output file of strings between one build and the next. If code moves around between builds it might introduce it’s own form of noise (but maybe git diff —color-moved would handle that still anyway)

I haven't seen enough diffs to exactly anticipate how these would look like but there might be different solutions like color-moved that could work depending on how it goes

Something like this could also be useful for reducing the 'noise' of code that moves around:

https://github.com/orgs/community/discussions/9632#discussioncomment-8769709
- As an alternative, you can use VS Code to review while highlighting moved code. It's available behind an experimental flag. It also works in githubs online version. So, simply press . in an open PR and review there.
- https://code.visualstudio.com/updates/v1_82#_moved-code-detection
- This iteration we polished the moved code detection feature. It can be enabled with "diffEditor.experimental.showMoves": true or in the diff editor context menu. When enabled, code blocks that are moved from one location to a different location in the same file are detected and arrows are drawn to indicate where the code blocks moved to.
  
  Code moves are also detected when they are slightly modified. The Compare button can be used to compare the block before and after the move.

michaelskyba commented 6 months ago

An easy way to connect from those strings back to the context in the code would definitely be interesting. Almost feels like something you could do with monaco editor (the core of vscode), and symbol links/similar:

Yeah, possibly; that's the kind of thing I was thinking could be useful as a text editor integration. But I think that would make sense as a wrapper around the main script. The output itself could maybe have the line numbers to the left of each finding, which would be parsed by the wrapper

I'm curious what your reasoning for wanting to avoid using node is though? Pretty much all of the scripts and automation in this repo currently are currently written in node/JS/TS; for many use cases it's a pretty useful tool.

I guess I just personally find the JS/npm ecosystem pretty annoying compared to e.g. Python or Go. It's not a dealbreaker though

If you want to go deep on this subject, there's heaps of stuff on my gists

Yeah, looks useful, I linked to it as part of my notes for the ChatGPT frontend

Here's a quick/dirty example that I hacked together with ChatGPT + some patterns from existing code

Cool, that's pretty good

I still want to look at Tree-sitter in case I like it more, but this kind of API from @babel/traverse looks fine enough

function extractStrings(ast) {
  const strings = [];
  traverse(ast, {
    StringLiteral: ({ node }) => {
      strings.push(node.value);
    },
    TemplateLiteral: ({ node }) => {
      node.quasis.forEach((element) => {
        strings.push(element.value.cooked);
      });
    },
  });
  return strings;
}

michaelskyba commented 6 months ago

Looks like building .so language files from source is deprecated in py-tree-sitter, but there are no official python bindings for the javascript grammar. Welp, that's annoying...

The third-party Go bindings in https://github.com/smacker/go-tree-sitter look more promising maybe

michaelskyba commented 6 months ago

Yeah, go-tree-sitter seems to work. Counting lines should be possible by looking at sourceCode.

michaelskyba commented 6 months ago

From the testing I've done to this point, I think would prefer go-tree-sitter to Babel, so I'll continue developing the formatting here. It'd still be useful to have your extract-js-strings.js for reference, though.

j4k0xb commented 6 months ago

Another approach I used in the past is renaming every variable and its references etc to a so the diff can detect actual changes pretty reliably: and ignore all name changes caused by minifying:

I wonder if its possible to get all the diff positions and then render it using the original names

michaelskyba commented 6 months ago

I wonder if its possible to get all the diff positions and then render it using the original names

That sounds like it could indeed save time on skipping some changes, but I'd expect some minification diffs to be structurally different too, such that the same values are written differently, even if the actual name is the same. But I expect @0xdevalias would have a more informed perspective

In other news, I got a basic byte tracker that checks for newlines, to map each name properly:

1:e
1:t
1:n
2:"use strict"
3:a
4:r
4:n
5:s
6:s
6:Object
6:assign
7:Object
7:assign
7:bind
8:e
9:t
9:t
9:arguments
9:length
9:t

j4k0xb commented 6 months ago

afaik minifiers should produce the exact same structure for unmodified regions of the source

michaelskyba commented 6 months ago

afaik minifiers should produce the exact same structure for unmodified regions of the source

Oh, that's more convenient, then. Does the formatter (biome/prettier) have the same property?

j4k0xb commented 6 months ago

depends, they sometimes introduce additional line breaks, parentheses, " " around all properties if one changes, etc

0xdevalias commented 6 months ago

RE: https://github.com/0xdevalias/chatgpt-source-watch/issues/10#issuecomment-2015460661

But I think that would make sense as a wrapper around the main script. The output itself could maybe have the line numbers to the left of each finding, which would be parsed by the wrapper

@michaelskyba Yeah definitely. Having a script that is able to be run standalone, but that can also be wrapped into an IDE/other tooling for a nicer GUI is how I would look at approaching it myself.

I guess I just personally find the JS/npm ecosystem pretty annoying compared to e.g. Python or Go. It's not a dealbreaker though

@michaelskyba Mm, fair enough. I find they all have their own issues and strengths, but don't tend to find one is objectively better than the other in most cases. Each to their own though.

I still want to look at Tree-sitter in case I like it more, but this kind of API from @babel/traverse looks fine enough

@michaelskyba Mm, makes sense.

RE: https://github.com/0xdevalias/chatgpt-source-watch/issues/10#issuecomment-2015650726

Yeah, go-tree-sitter seems to work. Counting lines should be possible by looking at sourceCode.

@michaelskyba Nice!

RE: https://github.com/0xdevalias/chatgpt-source-watch/issues/10#issuecomment-2015666180

From the testing I've done to this point, I think would prefer go-tree-sitter to Babel, so I'll continue developing the formatting here. It'd still be useful to have your extract-js-strings.js for reference, though.

@michaelskyba That's fair enough. extract-js-strings.js is committed in this repo, and no plans to remove it, so will be there for reference.

Curious, aside from not liking node/etc, what aspects of go-tree-sitter do you find preferable/more useful than Babel? Or is it just that it's not node?

RE: https://github.com/0xdevalias/chatgpt-source-watch/issues/10#issuecomment-2015789399

Another approach I used in the past is renaming every variable and its references etc to a so the diff can detect actual changes pretty reliably

@j4k0xb Yeah, that is more or less the same concept I'm following with my diff minimiser; though in the current PoC version of it, I'm 'stabilising' the identifier names after running an initial diff on it, rather than doing so pre-diff.

I still haven't cleaned it up + pushed the code here yet, but you can see some snippets of updates from when I was last working on it here:

https://github.com/0xdevalias/chatgpt-source-watch/issues/3#issuecomment-1925508005

I wonder if its possible to get all the diff positions and then render it using the original names

@j4k0xb If by 'get the diff positions' you mean the sections that are still diff after stabilising the names, then that is more or less how my diff minimiser script is doing things currently.

It basically ingests a diff, processes the added/removed chunks with an AST parser to normalise any identifiers, and then if the 'identifier normalised' code is equal, then it removes that chunk from the diff that is output at the end. That way the identifiers are only ever changed within the minimiser check itself, but you still end up with the original names in the output.

RE: https://github.com/0xdevalias/chatgpt-source-watch/issues/10#issuecomment-2015799139

I'd expect some minification diffs to be structurally different too, such that the same values are written differently, even if the actual name is the same.

@michaelskyba A lot of the time the majority of the code remains stable.

Occasionally I'll see webpack chunks move around a bit, and/or change the chunk ID that is referring to them/etc, but I think that usually only happens when related code has actually changed in the build, and then that causes them to be reworked when they are minimised/etc.

I pre-process the code with prettier/similar (switched from prettier to biome for speed, but it's essentially doing the same job) to stabilise the layout/formatting/etc; and I run the formatter twice because sometimes it seems to not fully work when only run once (at least that seemed to be the case with prettier)

In other news, I got a basic byte tracker that checks for newlines, to map each name properly

@michaelskyba Nice!

RE: https://github.com/0xdevalias/chatgpt-source-watch/issues/10#issuecomment-2015828448

afaik minifiers should produce the exact same structure for unmodified regions of the source

@j4k0xb That aligns with my thoughts/experience with them too.

RE: https://github.com/0xdevalias/chatgpt-source-watch/issues/10#issuecomment-2015843139

Does the formatter (biome/prettier) have the same property?

@michaelskyba Generally speaking, yeah, they tend to be pretty 'stable' in that regard (with the slight caveat that I mentioned above RE: needing to sometimes run them twice)

RE: https://github.com/0xdevalias/chatgpt-source-watch/issues/10#issuecomment-2015847277

depends, they sometimes introduce additional line breaks, parentheses, " " around all properties if one changes, etc

@j4k0xb I don't think I've seen this happen, or if I have, it's been so rare that I don't remember it/notice it.

The main times I have noticed an issue is when I need to run it twice because it failed to properly format everything the first time; and when I accidentally change the settings I am passing to it (like from using " to using '), in which case it definitely produces a lot of excess noise.

michaelskyba commented 6 months ago

Curious, aside from not liking node/etc, what aspects of go-tree-sitter do you find preferable/more useful than Babel? Or is it just that it's not node?

I didn't investigate Babel enough to have an even somewhat-informed opinion, but in my head, the interface that I wanted was to write a recursive function that would DFS through the tree, which is what I ended up being able to make easily with go-tree-sitter. From what I can tell just from seeing extract-js-strings.js, Babel traverse already includes its own traverse function, and takes objects that specify which code to run depending on the type of node found. This is less intuitive to me, but maybe there is a way to implement it in the former style in Babel too. All else equal, I'd prefer Go to node, though, yeah.

Something I've realized is that diffs of JavaScript files are further from valid JavaScript than I thought, such that it's impossible for tree-sitter to parse them out of the box. I think the way it would have to work is to extract the strings/identifiers of the new file and grep --invert-match against all previous findings.

But I'd like to view them in the context of the diff rather than just the new file. Just by going to a line and being told that xxx is new, the context of the change isn't obvious. So maybe there's also a way to find, given a line L in the new file, which line in the diff shows L being added. I'd expect it to be more complicated than a grep --fixed-strings because the same line could have been added multiple times.

It basically ingests a diff, processes the added/removed chunks with an AST parser to normalise any identifiers, and then if the 'identifier normalised' code is equal, then it removes that chunk from the diff that is output at the end. That way the identifiers are only ever changed within the minimiser check itself, but you still end up with the original names in the output.

How does your AST parse the chunks of those diffs? For a diff like this:

diff --git a/1.js b/2.js
index ad8dee1..923ec19 100644
--- a/1.js
+++ b/2.js
@@ -1,2 +1,2 @@
-console.log("bar \
+console.log("foo \
 bar")

, I don't see an obvious way to dodge the string quoting, besides mapping the changed lines back to the files they came from, which itself sounds difficult.

0xdevalias commented 6 months ago

I didn't investigate Babel enough to have an even somewhat-informed opinion, but in my head, the interface that I wanted was to write a recursive function that would DFS through the tree, which is what I ended up being able to make easily with go-tree-sitter. From what I can tell just from seeing extract-js-strings.js, Babel traverse already includes its own traverse function, and takes objects that specify which code to run depending on the type of node found. This is less intuitive to me, but maybe there is a way to implement it in the former style in Babel too.

@michaelskyba As far as I am aware, babel-traverse implements an AST visitor pattern, which does a DFS through the AST tree. So I believe you basically just re-invented the wheel there.

https://en.wikipedia.org/wiki/Visitor_pattern

It doesn't inherently have to specify which type of node to run against, it could run for all of them (you can see the most basic example using enter in the links above), but given we only wanted to match certain node types, it made the most sense to specify that filter up front.

You may find that last section on AST traversal a useful read even if you're not using Babel specifically, as it's a pretty common pattern from what I have seen.

Even with go-tree-sitter, you may be able to get closer to that functionality without having to manually brute force your own recursion through things, eg:

Something I've realized is that diffs of JavaScript files are further from valid JavaScript than I thought, such that it's impossible for tree-sitter to parse them out of the box.

@michaelskyba Different in what sense? Because the parts of the diff included aren't necessarily fully formed valid JavaScript?

I think the way it would have to work is to extract the strings/identifiers of the new file and grep --invert-match against all previous findings.

@michaelskyba Not sure I am following here. Why do you feel this would need to be done; and/or can you demonstrate your thoughts with an example to help explain it?

So maybe there's also a way to find, given a line L in the new file, which line in the diff shows L being added. I'd expect it to be more complicated than a grep --fixed-strings because the same line could have been added multiple times.

@michaelskyba Also not sure I'm fully following your logic here. Maybe the same clarity from explaining the above will help me here too?

How does your AST parse the chunks of those diffs? For a diff like this:
diff --git a/1.js b/2.js
index ad8dee1..923ec19 100644
--- a/1.js
+++ b/2.js
@@ -1,2 +1,2 @@
-console.log("bar \
+console.log("foo \
 bar")
, I don't see an obvious way to dodge the string quoting, besides mapping the changed lines back to the files they came from, which itself sounds difficult.

@michaelskyba What aspect of that are you asking about? The fact that the string is split over 2 lines, where one hasn't changed, and thus wouldn't be seen in the added/removed 'pair'; such that the code to be parsed would end up being just console.log("foo \?

I guess my first question would be whether that is ever a case that would exist after being prettified with prettier/biome/etc; but assuming it still would; for my needs, that is why I went from the standard 'strict' parser, to a more lenient parser + 'hacky code fixes' as vaguely mentioned here:

https://github.com/0xdevalias/chatgpt-source-watch/issues/3#issuecomment-1925508005

https://twitter.com/_devalias/status/1753711267137892684

A few more updates that I posted elsewhere as I was going; from 7100, to 6455, to 4268, and now down to 3913 lines.

Added a few more ‘pre AST parsing’ fixups that get the diff down even further: from the previous 7100 lines to 6455 lines

Trialled parsing the AST with a more lenient parser as a pre-step; filtering out unknown bits, then generating that back into code; that I then apply some of my previous ‘preparation hacks’ to, before parsing it with the more strict parser and doing the identifier normalisation.

That’s currently reduced the 6455 lines down to 4268 lines; and there are still leftover parsing errors that could potentially get that number lower still that I’m chipping away at.

Another alternative could be to completely replace the stricter parser with the more lenient one; but then I would have to rewrite a bunch of other code as well.

I could also potentially try a different lenient parser; or pre-process the code being passed to this one; as I believe it’s failing to parse some statements even; which is potentially leading to more of that noise filtering through..

I added a pre-processing step to the lenient parser which improved things a bit; then I added an actual fix to the AST from the lenient parser to help fix more complex ternaries that were challenging to fix in my simpler string append pre/post fix methods.

With those, we’re now down to 3913 lines.

Being able to properly detect and fix object properties that are missing their wrapping object would probably fix a bunch more of the issues.

Though maybe when I look at this next; I’ll just see whether using the lenient parser on its own gives a quicker/better result.

Originally posted by @0xdevalias in https://github.com/0xdevalias/chatgpt-source-watch/issues/3#issuecomment-1925508005

After that, I did end up trying the lenient parser on it's own, which still had some issues even after adding the hacky fixes to it; and then I believe I trialled using just the tokenizer from the lenient parser, which from memory ended up working pretty well for my needs, since it was still able to parse the Identifier nodes that I was needing to normalise.

Though for your use case of extracting the strings, I don't think parsing the diffs is the right approach. I would probably just parse the raw JS, and dump the full lot of strings out to a .txt file; then I would diff that whole text file of the latest version, against the whole text file of the previous version.

0xdevalias commented 6 months ago

A lot of this will probably be uninteresting/useless noise, but I just made a repo and pushed all of my hacky WIP/PoC AST exploration scripts to it so they're captured in one place:

https://github.com/0xdevalias/poc-ast-tools
- PoC scripts and tools for working with (primarily JavaScript) ASTs.

What might be of interest though, is that I pushed the current hacky state of my diff minimiser code there too. I haven't checked it for a good few weeks, so can't even remember if it's in a runnable state currently, but if nothing else, there might be some ideas hidden away in there that help explain what I was referring to above:

https://github.com/0xdevalias/poc-ast-tools/commit/9f7071f0bfea198e3bfe3a251b67d3340d76bd6c
- https://github.com/0xdevalias/poc-ast-tools/blob/main/diff-minimiser.js
- https://github.com/0xdevalias/poc-ast-tools/blob/main/diff-minimiser-poc-acorn.js

When I get a chance to get back to them, I plan to finish my research/refinements on the best method(s), and then clean it all up back to a single useful script that I will commit to this repo. But until then, those may be of some interest to someone.

michaelskyba commented 6 months ago

@michaelskyba As far as I am aware, babel-traverse implements an AST visitor pattern, which does a DFS through the AST tree. So I believe you basically just re-invented the wheel there.

It doesn't inherently have to specify which type of node to run against, it could run for all of them (you can see the most basic example using enter in the links above), but given we only wanted to match certain node types, it made the most sense to specify that filter up front.

Alright, yeah, Babel's system makes more sense now in the context of defining a visitor. I'd still prefer to use a general enter function and then have something like an if statement, but overall I could see it being more convenient than manual recursion.

traverse(ast, {
  enter(path) {
    if (
      path.node.type === "Identifier" &&
      path.node.name === "n"
    ) {
      path.node.name = "x";
    }
  }
});

I initially disliked tree-sitter's predicates when I first saw them but I've looked at some of the go.dev docs and maybe they would be cleaner, since I should still be able to track the source iteratively with

https://pkg.go.dev/github.com/smacker/go-tree-sitter#Node.Content

Now that I think about it, if this is in the context of being added to the scripts/ directory or similar, then it would make sense to follow what matches the existing style as best as possible. I haven't finalized the general design yet but I'll try to be more cognizant of that.

Something I've realized is that diffs of JavaScript files are further from valid JavaScript than I thought, such that it's impossible for tree-sitter to parse them out of the box.

@michaelskyba Different in what sense? Because the parts of the diff included aren't necessarily fully formed valid JavaScript?

Yes, that arbitrary segments taken from a JavaScript file won't be fully formed valid JavaScript by themselves. But also the surrounding diff syntax of e.g.

diff --git a/original.js b/vendor-6a62821ae00843ea.js?dpl=576b52374bc8a5e2b6fbb183818b6924a40171a2
index 05cb4d0..57d05c3 100644
--- a/original.js
+++ b/vendor-6a62821ae00843ea.js?dpl=576b52374bc8a5e2b6fbb183818b6924a40171a2

That could be filtered out depending on the output structure, though.

I think the way it would have to work is to extract the strings/identifiers of the new file and grep --invert-match against all previous findings.

@michaelskyba Not sure I am following here. Why do you feel this would need to be done; and/or can you demonstrate your thoughts with an example to help explain it?

Example: We have three version of a JS file: 0.js, 1.js, 2.js

The idea I'm describing in that quote is to do the following:

We extract the strings/identifiers of 0.js and put them into 0.txt.
Then, we extract strings/identifiers of 1.js, but only keep the ones not in 0.txt, and put those in 1.txt. Now, you could look at 1.txt to see the new strings/identifiers introduced in the 0.js --> 1.js revision.
Then, we extract the strings/identifiers of 2.js, but only keep the ones not in 0.txt, and not in 1.txt, and put those in 2.txt. Now, this will be the new strings/identifiers introduced in the 1.js --> 2.js revision.
etc.

This is the alternative to trying to first look at the diff between 0.js and 1.js or 1.js and 2.js, and then running the script on that diff as input. With the former idea of considering each file separately, the only input being passed to the AST program will be valid JavaScript, so it should always work as intended.

The downside is that if you look into 2.txt and see e.g. "DynamicModelSwitcherEnabled", you'd want to see which part of the 1.js --> 2.js diff shows that being introduced, which 2.txt doesn't tell you directly.

So maybe there's also a way to find, given a line L in the new file, which line in the diff shows L being added. I'd expect it to be more complicated than a grep --fixed-strings because the same line could have been added multiple times.

@michaelskyba Also not sure I'm fully following your logic here. Maybe the same clarity from explaining the above will help me here too?

Here, I'm talking about a potential solution to the downside I mentioned above. If 2.txt can report that "DynamicModelSwitcherEnabled" was found on line L in 2.js, it could be possible to look at the diff between 1.js and 2.js, and find where line L is added.

Hopefully that makes a bit more sense?

@michaelskyba What aspect of that are you asking about? The fact that the string is split over 2 lines, where one hasn't changed, and thus wouldn't be seen in the added/removed 'pair'; such that the code to be parsed would end up being just console.log("foo \?

Yes, which would be completely unparsable out of the box, as the rest of the code in the diff would be incorrectly considered a string. This was meant to be an example of how an AST parser can't pass a raw provided diff.

If you mean to not parse the file itself but to split it up into each diff chunk, and then to parse the removed / added lines separately, then that could potentially work better. I haven't tried that with tree-sitter.

I guess my first question would be whether that is ever a case that would exist after being prettified with prettier/biome/etc

I don't think this exact case of a string being split across multiple lines occurs in the real code, but something similar happened when I was first testing that broke most of the parsing. I might check again later if the exact context would be useful.

but assuming it still would; for my needs, that is why I went from the standard 'strict' parser, to a more lenient parser + 'hacky code fixes' as vaguely mentioned here:

After that, I did end up trying the lenient parser on it's own, which still had some issues even after adding the hacky fixes to it; and then I believe I trialled using just the tokenizer from the lenient parser, which from memory ended up working pretty well for my needs, since it was still able to parse the Identifier nodes that I was needing to normalise.

I see, that should work, but it confirms that there is no simple, universal solution.

Though for your use case of extracting the strings, I don't think parsing the diffs is the right approach. I would probably just parse the raw JS, and dump the full lot of strings out to a .txt file; then I would diff that whole text file of the latest version, against the whole text file of the previous version.

Yeah, that is essentially what I was thinking, except I wanted to only search for completely new entries. I'll see if I can make the string/identifier --> find insert line system work.

0xdevalias commented 6 months ago

Babel's system makes more sense now in the context of defining a visitor. I'd still prefer to use a general enter function and then have something like an if statement

@michaelskyba nods, yeah, that's fair.

Deciding whether it's better to use a raw enter + if pattern, or the basic 'single node type' pattern (like I used in my PoC, eg. StringLiteral(path) { /* ... */ }), or even seemingly 'multiple node types' (which I only learned about after linking the docs above, eg. "StringLiteral|TemplateLiteral"(path) { /* ... */ }) would largely depend on what the rest of the logic you need to implement will be I would suspect.

I initially disliked tree-sitter's predicates when I first saw them but I've looked at some of the go.dev docs and maybe they would be cleaner, since I should still be able to track the source iteratively with

pkg.go.dev/github.com/smacker/go-tree-sitter#Node.Content

@michaelskyba Quickly skimming through the docs on the using predicate S-expressions:

https://pkg.go.dev/github.com/smacker/go-tree-sitter#readme-predicates

For my own reference, it looks like you:

end up calling sitter.NewQuery
then sitter.NewQueryCursor
which returns a QueryCursor
then iterating over the results with qc.NextMatch
which returns QueryMatch
which contains a bunch of QueryCaptures
which refer to a Node

And then yeah, there seem to be heaps of useful things that Node gives you access to, including the node type, source, location within the source, etc.

I think this method is somewhat close to my 'named node type' visitor pattern that I used in my previous Babel example PoC.

Now that I think about it, if this is in the context of being added to the scripts/ directory or similar, then it would make sense to follow what matches the existing style as best as possible. I haven't finalized the general design yet but I'll try to be more cognizant of that.

@michaelskyba Yeah, if your goal is to include what you're working on in this repo, then matching the existing style of JS + Babel is ideal (unless there is a good reason to differ from it).

Though if this just ends up being your own personal exploration for learning, or to create a tool in it's own repo, obviously you can choose whatever language/lib choices you want to use/explore.

Yes, that arbitrary segments taken from a JavaScript file won't be fully formed valid JavaScript by themselves. But also the surrounding diff syntax

@michaelskyba nods, makes sense. So in my WIP PoC diff minimiser that I recently uploaded (Ref) I first parse the diff format using parse-diff, and then I do the AST parsing on the source chunks from within that parsed structure; which as mentioned above, still has the issue of 'arbitrary segments of javascript'.

Yes, which would be completely unparsable out of the box, as the rest of the code in the diff would be incorrectly considered a string. This was meant to be an example of how an AST parser can't pass a raw provided diff.

@michaelskyba Yup, makes sense.

If you mean to not parse the file itself but to split it up into each diff chunk, and then to parse the removed / added lines separately, then that could potentially work better. I haven't tried that with tree-sitter.

@michaelskyba Yeah, that was the approach I took in my WIP PoC diff minimiser (Ref):

A very rough walkthrough:

The main function calls parseArgsAndReadInput to read in the raw diff (from a git diff/similar)
It then calls parseDiff to parse the diff into objects
It then iterates through the files in the diff, then the diff chunks in each of those files
groupChunkChanges is called to group the added/removed lines within a chunk
Those groupedChanges are then iterated through, and compared with compareNormalizedCodeDiffs
Then later on a bunch of stuff is done to basically remove the changes that were considered equal, then output the newly filtered diff in the same format as was originally read in

It's currently quite messy/hacky code, with multiple implementations scattered throughout as I was exploring different ways of doing things (eg. compareNormalizedCodeDiffs, compareNormalizedCodeDiffsV2, etc)

It looks like the main AST parsing/comparison logic is in normalizeIdentifierNamesInCode / normalizeIdentifierNamesInCodeV2 / normalizeIdentifierNamesInCodeV3

You can also see the pre/post-processing steps I was taking to try and hack the 'arbitrary bits of JavaScript' into something that would parse in prepareCodeForASTParsing / prepareCodeForASTParsingV2 / prepareCodeForASTParsingV2PreProcess / prepareCodeForASTParsingV2PostProcess / etc

But after all of that, I think my ultimate conclusion (that I have yet to fully test/polish/turn into my final tool) was that I think I can get pretty good results, with minimal pre/post-hacking required, by just using acorn's tokenizer (Ref), rather than fully parsing the code into an AST.

https://github.com/acornjs/acorn/tree/master/acorn
- tokenizer(input, options) returns an object with a getToken method that can be called repeatedly to get the next token, a {start, end, type, value} object (with added loc property when the locations option is enabled and range property when the ranges option is enabled). When the token's type is tokTypes.eof, you should stop calling the method, since it will keep returning that same token forever.
  
  Note that tokenizing JavaScript without parsing it is, in modern versions of the language, not really possible due to the way syntax is overloaded in ways that can only be disambiguated by the parse context. This package applies a bunch of heuristics to try and do a reasonable job, but you are advised to use parse with the onToken option instead of this.

I need to play around with that more, and may end up going back to another library (eg. if I can use Babel's tokenizer or similar directly). I originally was looking at acorn as it has acorn-loose (Babel didn't seem to have an equivalent loose parser), but it's a somewhat outdated parser I believe:

https://github.com/acornjs/acorn/tree/master/acorn-loose
- This parser will parse any text into an ESTree syntax tree that is a reasonable approximation of what it might mean as a JavaScript program.
  
  It will, to recover from missing brackets, treat whitespace as significant, which has the downside that it might mis-parse a valid but weirdly indented file. It is recommended to always try a parse with the regular acorn parser first, and only fall back to this parser when that one finds syntax errors.

I don't think this exact case of a string being split across multiple lines occurs in the real code, but something similar happened when I was first testing that broke most of the parsing. I might check again later if the exact context would be useful.

@michaelskyba nods, fair, makes sense.

I see, that should work, but it confirms that there is no simple, universal solution.

@michaelskyba Well, I guess in the end, depending on what you mean by 'universal solution', I think the tokenizer approach is likely pretty close to a 'universal solution'.

Yeah, that is essentially what I was thinking, except I wanted to only search for completely new entries.

@michaelskyba Curious why you want to 'only search for completely new entries' at this stage of things within the AST parsing tool? I believe you'll get basically the same outcome (with simpler code) by embracing the unix philosophy (Ref) of combining tools. That way you can have the AST parsing script output all of the strings, and then just do a standard git diff / similar between those 'full string extracted files' (as I described earlier); which would then (I believe) produce the view of the 'only completely new entries' that you're after.

I'll see if I can make the string/identifier --> find insert line system work.

@michaelskyba Curious to see your approach here if you do get it to work.

(As just because the method I described above is how I believe would be simplest to implement it, doesn't mean that there isn't something i'm missing/unaware of/etc that would end up being an even better way)

Example: We have three version of a JS file: 0.js, 1.js, 2.js

The idea I'm describing in that quote is to do the following:

We extract the strings/identifiers of 0.js and put them into 0.txt.

Then, we extract strings/identifiers of 1.js, but only keep the ones not in 0.txt, and put those in 1.txt. Now, you could look at 1.txt to see the new strings/identifiers introduced in the 0.js --> 1.js revision.

Then, we extract the strings/identifiers of 2.js, but only keep the ones not in 0.txt, and not in 1.txt, and put those in 2.txt. Now, this will be the new strings/identifiers introduced in the 1.js --> 2.js revision.

etc.

This is the alternative to trying to first look at the diff between 0.js and 1.js or 1.js and 2.js, and then running the script on that diff as input. With the former idea of considering each file separately, the only input being passed to the AST program will be valid JavaScript, so it should always work as intended.

The downside is that if you look into 2.txt and see e.g. "DynamicModelSwitcherEnabled", you'd want to see which part of the 1.js --> 2.js diff shows that being introduced, which 2.txt doesn't tell you directly.

@michaelskyba nods, right, that makes sense. Though I'm curious, how do you see the method I described above/previously comparing with a pattern like this (better/worse/etc?)? Where you basically:

extract all of the strings from build-0.js and put them in 0.txt
extract all of the strings from build-1.js and put them in 1.txt
extract all of the strings from build-2.js and put them in 2.txt
now to see the new strings between build-0 and build-1 we can just diff 0.txt 1.txt
or to see the new strings between build-1 and build-2 we can just diff 1.txt 2.txt
though this would also give us the flexibility to easily see the new strings between build-0 and build-2 with just diff 0.txt 2.txt

If this '[filename]-strings.txt' file was committed alongside each *.js file in each build in the git repo, then you would be able to diff any arbitrary build version against any other version without having to re-run the string extraction script.

Or if we didn't want to store those files in the git repo, we could still diff any arbitrary build version against another, but would just need to re-run the extraction script each time. I haven't tested this, so it may need slight tweaks, but this is the basic command ChatGPT drafted for doing this when I asked:

diff \
  <(git show commit1:path/to/example.js | node extract-strings.js) \
  <(git show commit2:path/to/example.js | node extract-strings.js)

Here, I'm talking about a potential solution to the downside I mentioned above. If 2.txt can report that "DynamicModelSwitcherEnabled" was found on line L in 2.js, it could be possible to look at the diff between 1.js and 2.js, and find where line L is added.

Hopefully that makes a bit more sense?

@michaelskyba Right, yup, I think I'm following now.

Curious if the method I described above is a satisfactory solution to that, or if you would still be wanting to try and manually keep track of this sort of data between the 'pre-diffed' *.txt files in the method you described previously?

0xdevalias commented 6 months ago

Quickly skimming through the docs on the using predicate S-expressions:

pkg.go.dev/github.com/smacker/go-tree-sitter#readme-predicates

@michaelskyba I haven't tested this at all, so it might not be perfectly correct, but this is what ChatGPT drafted up based on the method I was talking about with using the predicate function of go-tree-sitter:

package main

import (
  "fmt"
  "io/ioutil"
  "log"

  "github.com/smacker/go-tree-sitter"
  "github.com/smacker/go-tree-sitter/javascript"
)

func main() {
  // Load your JavaScript code from a file
  content, err := ioutil.ReadFile("your_file.js")
  if err != nil {
    log.Fatalf("Failed to read file: %v", err)
  }

  // Parse the content using tree-sitter
  parser := sitter.NewParser()
  parser.SetLanguage(javascript.GetLanguage())

  // Create a tree from the content
  tree := parser.Parse(nil, content)
  defer tree.Delete()

  // Prepare the query
  queryString := `(string) @string`

  // Compile the query
  query, err := sitter.NewQuery([]byte(queryString), javascript.GetLanguage())
  if err != nil {
    log.Fatalf("Failed to create query: %v", err)
  }
  defer query.Delete()

  // Create a query cursor
  cursor := sitter.NewQueryCursor()
  defer cursor.Delete()

  // Execute the query
  cursor.Exec(query, tree.RootNode())

  // Iterate over the matches
  for {
    match, ok := cursor.NextMatch()
    if !ok {
      break
    }

    for _, capture := range match.Captures {
      // Get the node for the current capture
      node := capture.Node

      // Print the content of the string literal
      fmt.Println(node.Content(content))
    }
  }
}

michaelskyba commented 5 months ago

This isn't specifically related, but I described it poorly initially, so to clarify, I think identifiers could be useful indicators too, and not just strings, so I'm trying to extract both types.

@michaelskyba Yeah, if your goal is to include what you're working on in this repo, then matching the existing style of JS + Babel is ideal (unless there is a good reason to differ from it).

Though if this just ends up being your own personal exploration for learning, or to create a tool in it's own repo, obviously you can choose whatever language/lib choices you want to use/explore.

Hmm, that's a good point. Using Go or tree-sitter at all is probably already a more significant departure than the actual style of manual recursion vs visitor, or if statement vs node type pattern. I might just continue storing it in mine for now, and then perhaps make a rewrite in the future if it would make sense.

@michaelskyba nods, makes sense. So in my WIP PoC diff minimiser that I recently uploaded (Ref) I first parse the diff format using parse-diff, and then I do the AST parsing on the source chunks from within that parsed structure; which as mentioned above, still has the issue of 'arbitrary segments of javascript'.

The main function calls parseArgsAndReadInput to read in the raw diff (from a git diff/similar)

It then calls parseDiff to parse the diff into objects

It then iterates through the files in the diff, then the diff chunks in each of those files

groupChunkChanges is called to group the added/removed lines within a chunk

Those groupedChanges are then iterated through, and compared with compareNormalizedCodeDiffs

Then later on a bunch of stuff is done to basically remove the changes that were considered equal, then output the newly filtered diff in the same format as was originally read in

It's currently quite messy/hacky code, with multiple implementations scattered throughout as I was exploring different ways of doing things (eg. compareNormalizedCodeDiffs, compareNormalizedCodeDiffsV2, etc)

That makes more sense now, and is pretty sophisticated. Despite the chunks being arbitrary segments, I'd expect it to perform significantly better out of the box than running the AST on the raw diff. The string split across multiple lines scenario I mentioned, for example, shouldn't ever happen, because it is only caused by the second file's chunk

console.log("foo \
bar")

being parsed as if it was part of the first:

console.log("bar \
console.log("foo \
bar")

But after all of that, I think my ultimate conclusion (that I have yet to fully test/polish/turn into my final tool) was that I think I can get pretty good results, with minimal pre/post-hacking required, by just using acorn's tokenizer (Ref), rather than fully parsing the code into an AST.

I need to play around with that more, and may end up going back to another library (eg. if I can use Babel's tokenizer or similar directly). I originally was looking at acorn as it has acorn-loose (Babel didn't seem to have an equivalent loose parser), but it's a somewhat outdated parser I believe:

@michaelskyba Well, I guess in the end, depending on what you mean by 'universal solution', I think the tokenizer approach is likely pretty close to a 'universal solution'.

Oh, I see, I previously misunderstood what the context and role of the tokenizer was. Then yeah, I retract that; it should be general enough. In the case of only doing the extraction, it still looks like it'd be easier to run an AST before the diff, though.

@michaelskyba Curious why you want to 'only search for completely new entries' at this stage of things within the AST parsing tool? I believe you'll get basically the same outcome (with simpler code) by embracing the unix philosophy (Ref) of combining tools. That way you can have the AST parsing script output all of the strings, and then just do a standard git diff / similar between those 'full string extracted files' (as I described earlier); which would then (I believe) produce the view of the 'only completely new entries' that you're after.

There I (ambiguously) meant the "search" in the context of the outer diff, not inside the same AST parsing program. I'm generally a fan of the Unix philosophy, and try to follow it when, for example, writing personal shell scripts.

I was thinking of directly using a standard git diff between files, but I see a few issues with it.

I want to only look at the new entries and not the removed entries.
If an entry is removed in a prior version of the code and then added back in the latest version, it would show up in the diff, but I would want it to be filtered out.
The line numbers couldn't be directly included in the file, or else the diff would count the same entry on a different line as a different entry.

All of those are pretty easy to solve, though.

Maybe it'd make the most sense to only store the line-numbered versions of the entries of each file. Then, diff could still be used by removing the line numbers on the fly with something like cut -d ":" -f 2. For the completely new entries, we could have something similar (this literal example wouldn't work) to cat 0.txt 1.txt | cut -d ":" -f 2 | sort | uniq > $tmp and then grep -v -f $tmp 2.txt.

I'll see if I can make the string/identifier --> find insert line system work.

@michaelskyba Curious to see your approach here if you do get it to work.

(As just because the method I described above is how I believe would be simplest to implement it, doesn't mean that there isn't something i'm missing/unaware of/etc that would end up being an even better way)

I think this part should be compatible with your method. That was just an idea for finding the context of an entry inside the js --> js diff.

@michaelskyba nods, right, that makes sense. Though I'm curious, how do you see the method I described above/previously comparing with a pattern like this (better/worse/etc?)? Where you basically:

extract all of the strings from build-0.js and put them in 0.txt

extract all of the strings from build-1.js and put them in 1.txt

extract all of the strings from build-2.js and put them in 2.txt

now to see the new strings between build-0 and build-1 we can just diff 0.txt 1.txt

or to see the new strings between build-1 and build-2 we can just diff 1.txt 2.txt

though this would also give us the flexibility to easily see the new strings between build-0 and build-2 with just diff 0.txt 2.txt

If this '[filename]-strings.txt' file was committed alongside each *.js file in each build in the git repo, then you would be able to diff any arbitrary build version against any other version without having to re-run the string extraction script.

Curious if the method I described above is a satisfactory solution to that, or if you would still be wanting to try and manually keep track of this sort of data between the 'pre-diffed' *.txt files in the method you described previously?

Originally I was thinking of only storing new entries in each txt file, but I think you're right that having 2.txt contain all the entries from build-2.js would provide much more flexibility. However, as I mentioned above, I don't think diff by itself would be sufficient for analyzing the resulting txt files.

I think the line numbers would have to be stored in the pre-diffed txt files, or there'd be a lot of additional complexity for locating lines in the js --> js diff.

Or if we didn't want to store those files in the git repo, we could still diff any arbitrary build version against another, but would just need to re-run the extraction script each time. I haven't tested this, so it may need slight tweaks, but this is the basic command ChatGPT drafted for doing this when I asked:
diff \
  <(git show commit1:path/to/example.js | node extract-strings.js) \
  <(git show commit2:path/to/example.js | node extract-strings.js)

Yeah, something like this could also work. I guess the convenience will depend on how long the extractor will take to run.

0xdevalias commented 5 months ago

I think identifiers could be useful indicators too, and not just strings, so I'm trying to extract both types.

@michaelskyba Identifiers churning between builds are sort of specifically the thing that creates the noise we're trying to eliminate though..?

I might just continue storing it in mine for now, and then perhaps make a rewrite in the future if it would make sense.

@michaelskyba That's fair :)

That makes more sense now, and is pretty sophisticated. Despite the chunks being arbitrary segments, I'd expect it to perform significantly better out of the box than running the AST on the raw diff.

@michaelskyba nods it seems to work pretty well. Another alternative is to not do the text-diff first, and just do an 'AST diff' or similar (while ignoring identifier names changing), but I believe that would run a lot slower (or at least, the existing tools I tried definitely seemed to be slower)

In the case of only doing the extraction, it still looks like it'd be easier to run an AST before the diff, though.

@michaelskyba See comment above RE: this. Speed issues + laziness for not wanting to implement my own AST diff algorithm aside, that would probably be the 'ideal' 'pure' way to achieve this. I was trying to leverage a text based diff as a starting point for speed reasons.

I was thinking of directly using a standard git diff between files, but I see a few issues with it.

I want to only look at the new entries and not the removed entries.

If an entry is removed in a prior version of the code and then added back in the latest version, it would show up in the diff, but I would want it to be filtered out.

The line numbers couldn't be directly included in the file, or else the diff would count the same entry on a different line as a different entry.

All of those are pretty easy to solve, though.

@michaelskyba Of those, 1 would be almost trivial to handle as is; 2 would be slightly more annoying (though I also wonder how often that case would come up in reality, and if it needs to be 'solved for'); 3 it seems you already have some decent ideas on how to work around.

That was just an idea for finding the context of an entry inside the js --> js diff.

@michaelskyba Makes sense :)

Originally I was thinking of only storing new entries in each txt file, but I think you're right that having 2.txt contain all the entries from build-2.js would provide much more flexibility. However, as I mentioned above, I don't think diff by itself would be sufficient for analyzing the resulting txt files.

I think the line numbers would have to be stored in the pre-diffed txt files, or there'd be a lot of additional complexity for locating lines in the js --> js diff.

@michaelskyba nods yeah, makes sense.

Yeah, something like this could also work. I guess the convenience will depend on how long the extractor will take to run.

@michaelskyba nods yeah, exactly. I would assume it would run too slow and so be kind of annoying; but that could be improved by implementing a local cache (outside of the git repo) type thing such that for a given build, it only ever needs to run the actual parsing code once, and then would lookup from the cache after that (unless cleared).

Another reason I would think about committing the intermediary files is then you can sort of 'interactively explore' them through GitHub's diff interface; which was also one of the reasons I started committing the builds themselves (though the noise of the identifier churn between them sort of makes it a bit unusable for that purpose currently unfortunately)

michaelskyba commented 5 months ago

@michaelskyba Identifiers churning between builds are sort of specifically the thing that creates the noise we're trying to eliminate though..?

I haven't looked at the diffs that much but from what I've seen, the churn only affects a limited number of the identifiers, such as the ones comprised of 1-2 characters like

   e.d(t, {
        HV: function () {
          return N;
        },
        N8: function () {
          return S;
        },
        aP: function () {
          return m;
        },
        do: function () {
          return R;
        },

This is why I wanted to be able to look at introduced identifiers that have never been seen in any previous JavaScript file version. That way, these common churn identifiers would be found and eliminated, and then we'd only start seeing the more interesting ones like activityRegularizer or checkGate.

@michaelskyba nods it seems to work pretty well. Another alternative is to not do the text-diff first, and just do an 'AST diff' or similar (while ignoring identifier names changing), but I believe that would run a lot slower (or at least, the existing tools I tried definitely seemed to be slower)

@michaelskyba See comment above RE: this. Speed issues + laziness for not wanting to implement my own AST diff algorithm aside, that would probably be the 'ideal' 'pure' way to achieve this. I was trying to leverage a text based diff as a starting point for speed reasons.

Interesting, I didn't think about a diff on the AST level. But yeah, I guess it isn't a common enough use case for there to have been a lot of work already out there on matching the efficiency of text diffs.

@michaelskyba Of those, 1 would be almost trivial to handle as is; 2 would be slightly more annoying (though I also wonder how often that case would come up in reality, and if it needs to be 'solved for'); 3 it seems you already have some decent ideas on how to work around.

Number 2 is relevant to the identifier churn mentioned above. I would assume that something like aP could be churned out in one revision and churned back in at a different revision.

@michaelskyba nods yeah, exactly. I would assume it would run too slow and so be kind of annoying; but that could be improved by implementing a local cache (outside of the git repo) type thing such that for a given build, it only ever needs to run the actual parsing code once, and then would lookup from the cache after that (unless cleared).

Yeah, if there's a way to store some piece of cache smaller than the full output that still significantly speeds up the execution time.

Another reason I would think about committing the intermediary files is then you can sort of 'interactively explore' them through GitHub's diff interface; which was also one of the reasons I started committing the builds themselves (though the noise of the identifier churn between them sort of makes it a bit unusable for that purpose currently unfortunately)

Yeah, somebody interested in some ~casual browsing would probably not want to clone the code and run the extractor themselves. Diffs between extraction output would probably be similarly difficult to look at directly, though.

michaelskyba commented 5 months ago

I rewrote it (no line tracking yet) to use the following query pattern:

(string) @string
(template_string) @template
(identifier) @identifier
(property_identifier) @property
(shorthand_property_identifier) @shorthand
(comment) @comment

Here is the diff between the extractions (| sort | uniq) for two vendor- .js files I tested: https://gist.github.com/michaelskyba/fb93861fe6c4880138a586716e4beb18

If I only look for added entries rather than the diff itself, it's only 8 lines, not including the source map line that prettier added and the find_this_test_123 I added as a test:

$ rg -x -F -v -f 1713546501-cf14668063190b4c-uniq.txt 1713620903-c41f7bc7c59d56a3-uniq.txt
311:find_this_test_123
1396:"16 16 12 12 8 16"
9002:"&dpl=9133c19ee9ae73caaad3f40e3ac2f20dd10e0fdf"
9003:"?dpl=9133c19ee9ae73caaad3f40e3ac2f20dd10e0fdf"
13739:`headlessui-menu-button-${(0, p.M)()}`
13740:`headlessui-menu-item-${(0, p.M)()}`
13741:`headlessui-menu-items-${(0, p.M)()}`
18089:"M20.39 18.39A5 5 0 0 0 18 9h-1.26A8 8 0 1 0 3 16.3"
24471:Re4
27786://# sourceMappingURL=vendor-c41f7bc7c59d56a3.js.map

(Edit: I realize now that sourceMappingURL is part of the original file and was not added by Prettier.)

The diff between the actual JS files was 13622 lines and just looked like identifier churn to me. Snippet:

@@ -140080,7 +139375,7 @@ WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE
         return nS(e).getPropertyValue(t);
       }
       let nA = ["top", "right", "bottom", "left"];
-      function nx(e, t, n) {
+      function nT(e, t, n) {
         let r = {};
         n = n ? "-" + n : "";
         for (let i = 0; i < 4; i++) {
@@ -140089,14 +139384,14 @@ WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE
         }
         return (r.width = r.left + r.right), (r.height = r.top + r.bottom), r;
       }
-      let nT = (e, t, n) => (e > 0 || t > 0) && (!n || !n.shadowRoot);
+      let nx = (e, t, n) => (e > 0 || t > 0) && (!n || !n.shadowRoot);
       function nC(e, t) {
         let n, r;
         let i = e.touches,
           o = i && i.length ? i[0] : e,
           { offsetX: a, offsetY: s } = o,
           l = !1;
-        if (nT(a, s, e.target)) (n = a), (r = s);
+        if (nx(a, s, e.target)) (n = a), (r = s);

0xdevalias commented 5 months ago

I haven't looked at the diffs that much but from what I've seen, the churn only affects a limited number of the identifiers, such as the ones comprised of 1-2 characters like

This is why I wanted to be able to look at introduced identifiers that have never been seen in any previous JavaScript file version. That way, these common churn identifiers would be found and eliminated, and then we'd only start seeing the more interesting ones like activityRegularizer or checkGate.

@michaelskyba Mm, fair enough. There are definitely some that seem to not get minimised, but it's more like the majority of identifiers that get minimised (and thus churn) between builds.

Make sense RE: the 'only new' aspect. Though you could probably also eliminate a lot of that noise (with minimal false positives) by filtering out short identifiers that match the patterns used in minification.

I guess it isn't a common enough use case for there to have been a lot of work already out there on matching the efficiency of text diffs.

@michaelskyba nods true true. You can see some of the tools I explored in this issue:

https://github.com/0xdevalias/chatgpt-source-watch/issues/3

Such as:

https://github.com/Wilfred/difftastic
- a structural diff that understands syntax
- https://github.com/Wilfred/difftastic/issues/631
- https://github.com/Wilfred/difftastic/issues/508
https://github.com/afnanenayet/diffsitter
- A tree-sitter based AST difftool to get meaningful semantic diffs
- https://github.com/afnanenayet/diffsitter/issues/819
- https://github.com/afnanenayet/diffsitter/issues/149
- https://github.com/afnanenayet/diffsitter/issues/744

I would assume that something like aP could be churned out in one revision and churned back in at a different revision.

@michaelskyba Yeah, absolutely would be.

Yeah, if there's a way to store some piece of cache smaller than the full output that still significantly speeds up the execution time.

@michaelskyba It wouldn't even necessarily need to be smaller than the full output. The key difference as I see it is whether it would be a file that you would track within the repo/etc; or more of a 'hidden away in the cache' thing.

Obviously depending on your workflow/etc, you could just manually store the pre-calculated versions somewhere outside of a git repo anyway, and it would have a similar effect.

Yeah, somebody interested in some ~casual browsing would probably not want to clone the code and run the extractor themselves. Diffs between extraction output would probably be similarly difficult to look at directly, though.

@michaelskyba Mm, to some degree probably yeah. Though I guess it depends how much of a 'casual browser' we're talking about, and how much we reduce the noise in whatever output format it is.

My ideal goal would be to be able to re-write the identifiers to something stable (yet meaningful) between builds + be able to account for modules that 'move around', such that the diff shown on github (for the unpacked/ directory) would only show the actual meaningful code changes between each build.

Here is the diff between the extractions (| sort | uniq) for two vendor- .js files I tested

@michaelskyba It might be more interesting (and yet way larger) to look at the _app.js file; as that is often where the bulk of interesting changes seem to be.

If I only look for added entries rather than the diff itself, it's only 8 lines, not including the source map line that prettier added and the find_this_test_1234 I added as a test

@michaelskyba Neat!

The diff between the actual JS files was 13622 lines and just looked like identifier churn to me.

@michaelskyba Would be curious to see what that same diff would look like running through my diff minimiser (even in it's current non-polished form):

https://github.com/0xdevalias/poc-ast-tools/blob/main/diff-minimiser.js

Seemingly 104 lines of diff normally:

⇒ git diff 6d96656142aabcc862846ad7852422f0a8f14dbe:unpacked/_next/static/chunks/vendor.js 6eab9108d9ed3124b4ba757ee9f29e892082deb5:unpacked/_next/static/chunks/vendor.js | wc -l

     104

And 66 after running through the minimiser:

⇒ git diff 6d96656142aabcc862846ad7852422f0a8f14dbe:unpacked/_next/static/chunks/vendor.js 6eab9108d9ed3124b4ba757ee9f29e892082deb5:unpacked/_next/static/chunks/vendor.js | ./scripts/ast-poc/diff-minimiser.js 2>/dev/null | wc -l

      66

But a lot of those extra lines are context/diff boilerplate:

Minimised Diff output

```diff diff --git a/unpacked/_next/static/chunks/vendor.js b/unpacked/_next/static/chunks/vendor.js index dcd0d0a..14aa329 100644 --- a/unpacked/_next/static/chunks/vendor.js +++ b/unpacked/_next/static/chunks/vendor.js @@ -69965,7 +69965,7 @@ r + "&q=" + (i || 75) + - "&dpl=fbb61498fbf3416255234adfde4320b0577c7e49" + "&dpl=46b7f37021d2244b5b4454b927e0735229de5626" ); } Object.defineProperty(t, "__esModule", { value: !0 }), @@ -73156,7 +73156,7 @@ WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE 64302: function (e, t) { "use strict"; function n() { - return "?dpl=fbb61498fbf3416255234adfde4320b0577c7e49"; + return "?dpl=46b7f37021d2244b5b4454b927e0735229de5626"; } Object.defineProperty(t, "__esModule", { value: !0 }), Object.defineProperty(t, "getDeploymentIdQueryOrEmptyString", { @@ -135660,13 +135656,11 @@ WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE if ( ((this.options = this.#O.defaultQueryOptions(e)), (0, r.VS)(n, this.options) || - this.#O - .getQueryCache() - .notify({ - type: "observerOptionsUpdated", - query: this.#A, - observer: this, - }), + this.#O.getQueryCache().notify({ + type: "observerOptionsUpdated", + query: this.#A, + observer: this, + }), void 0 !== this.options.enabled && "boolean" != typeof this.options.enabled) ) @@ -136602,13 +136596,11 @@ WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE let t = this.options; (this.options = this.#O.defaultMutationOptions(e)), (0, s.VS)(t, this.options) || - this.#O - .getMutationCache() - .notify({ - type: "observerOptionsUpdated", - mutation: this.#Y, - observer: this, - }), + this.#O.getMutationCache().notify({ + type: "observerOptionsUpdated", + mutation: this.#Y, + observer: this, + }), this.#Y?.setOptions(this.options); } onUnsubscribe() { @@ -196309,4 +196301,4 @@ ${t.comment}` }, }, ]); -//# sourceMappingURL=vendor-9a1d7ae664409493.js.map +//# sourceMappingURL=vendor-4cb082c626a8af6f.js.map ```

If we wanted to then filter it down to just added the lines, then we get 13:

⇒ git diff 6d96656142aabcc862846ad7852422f0a8f14dbe:unpacked/_next/static/chunks/vendor.js 6eab9108d9ed3124b4ba757ee9f29e892082deb5:unpacked/_next/static/chunks/vendor.js | ./scripts/ast-poc/diff-minimiser.js 2>/dev/null | grep '^+' | grep -v '^+++' | wc -l
      13

⇒ git diff 6d96656142aabcc862846ad7852422f0a8f14dbe:unpacked/_next/static/chunks/vendor.js 6eab9108d9ed3124b4ba757ee9f29e892082deb5:unpacked/_next/static/chunks/vendor.js | ./scripts/ast-poc/diff-minimiser.js 2>/dev/null | grep '^+' | grep -v '^+++'
+          "&dpl=46b7f37021d2244b5b4454b927e0735229de5626"
+        return "?dpl=46b7f37021d2244b5b4454b927e0735229de5626";
+                this.#O.getQueryCache().notify({
+                  type: "observerOptionsUpdated",
+                  query: this.#A,
+                  observer: this,
+                }),
+                this.#O.getMutationCache().notify({
+                  type: "observerOptionsUpdated",
+                  mutation: this.#Y,
+                  observer: this,
+                }),
+//# sourceMappingURL=vendor-4cb082c626a8af6f.js.map

Doing that same example for _app.js:

⇒ git diff 6d96656142aabcc862846ad7852422f0a8f14dbe:unpacked/_next/static/chunks/pages/_app.js 6eab9108d9ed3124b4ba757ee9f29e892082deb5:unpacked/_next/static/chunks/pages/_app.js | wc -l

   14051

⇒ git diff 6d96656142aabcc862846ad7852422f0a8f14dbe:unpacked/_next/static/chunks/pages/_app.js 6eab9108d9ed3124b4ba757ee9f29e892082deb5:unpacked/_next/static/chunks/pages/_app.js | ./scripts/ast-poc/diff-minimiser.js 2>/dev/null | wc -l

    2901

⇒ git diff 6d96656142aabcc862846ad7852422f0a8f14dbe:unpacked/_next/static/chunks/pages/_app.js 6eab9108d9ed3124b4ba757ee9f29e892082deb5:unpacked/_next/static/chunks/pages/_app.js | ./scripts/ast-poc/diff-minimiser.js 2>/dev/null | grep '^+' | grep -v '^+++' | wc -l

    1003

Personally, I find the 'only newly added' removes too much context and makes it harder to review; so that 2901 view is the type of one I would usually look at. It still has noise in it due to needing to refine the minimiser further; and there are some parts that are technically not noise (as I define it), that someone might find value in also tuning out, which could make it smaller again. But (aside from where huge blocks of code have moved, and currently lose their --color-moved aspect), I find it quite usable.

The diff I chose there was just the latest one though, and I have often seen _app.js diffs start off more like 60k+ lines; so obviously with those, the minimised version is still going to be larger than this example.

0xdevalias commented 5 months ago

Here's a fun little hack I just thought up (while the diff minimiser script still breaks --color-moved):

Diff 2 files, run them through the minimiser, save that diff to a file
Apply the output of the minimised diff to the original file, save that to a file
Re-run diff with --color-moved between the original file, and the original+minimisedDiffPatch file

Currently this doesn't work fully, as I think I'm not properly updating the line metadata in the diff hunks when I've removed sections from them in the diff minimiser.. but something like this (in theory):

⇒ git diff 6d96656142aabcc862846ad7852422f0a8f14dbe:unpacked/_next/static/chunks/pages/_app.js 6eab9108d9ed3124b4ba757ee9f29e892082deb5:unpacked/_next/static/chunks/pages/_app.js | ./scripts/ast-poc/diff-minimiser.js 2>/dev/null | \
sed 's|a/unpacked/_next/static/chunks/pages/_app.js|a/_app.original.js|g' | \
sed 's|b/unpacked/_next/static/chunks/pages/_app.js|b/_app.patched.js|g' \
> _app.minimised.diff

⇒ git show 6d96656142aabcc862846ad7852422f0a8f14dbe:unpacked/_next/static/chunks/pages/_app.js > _app.original.js

⇒ git apply _app.minimised.diff

⇒ git -c color.diff.oldMoved='white dim' \
    -c color.diff.oldMovedAlternative='white dim' \
    -c color.diff.newMoved='white dim' \
    -c color.diff.newMovedAlternative='white dim' \
    -c color.diff.newMovedDimmed='white dim' \
    -c color.diff.newMovedAlternativeDimmed='white dim' \
    diff --ignore-blank-lines --color-moved=dimmed-zebra --color-moved-ws=ignore-all-space --minimal 6d96656142aabcc862846ad7852422f0a8f14dbe:unpacked/_next/static/chunks/pages/_app.js _app.patched.js

Edit: I wrote a new scripts/fix-diff-headers.js (https://github.com/0xdevalias/chatgpt-source-watch/commit/f7deec74cca9127bc2f18279d83f7a088f3f35f1) script that seems to get it closer to working..

⇒ ./scripts/fix-diff-headers.js _app.minimised.diff _app.fixed.diff
Reading diff from _app.minimised.diff and writing corrected diff to _app.fixed.diff
Diff successfully corrected and written to _app.fixed.diff

⇒ patch _app.original.js _app.fixed.diff -o _app.patched.js
patching file _app.original.js
6 out of 90 hunks failed--saving rejects to _app.patched.js.rej

⇒ git -c color.diff.oldMoved='white dim' \
    -c color.diff.oldMovedAlternative='white dim' \
    -c color.diff.newMoved='white dim' \
    -c color.diff.newMovedAlternative='white dim' \
    -c color.diff.newMovedDimmed='white dim' \
    -c color.diff.newMovedAlternativeDimmed='white dim' \
    diff --ignore-blank-lines --color-moved=dimmed-zebra --color-moved-ws=ignore-all-space --minimal 6d96656142aabcc862846ad7852422f0a8f14dbe:unpacked/_next/static/chunks/pages/_app.js _app.patched.js

That way, instead of just seeing a giant chunk of 'removed' and a giant chunk of 'added' when this module moved; we can see the dimmed lines which were moved (but otherwise unchanged), and spend less time/effort having to consider them:

An even better version of this would be a diff --color-moved implementation that was able to ignore identifier names changing entirely (in which case I suspect it would mark this entire block as 'moved but not changed'); but as far as I am currently aware, the only way to do that would be to basically re-implement the whole diff --color-moved logic ourselves; which would basically end up with us writing our own AST differ I believe.

Though the way I was originally intending on fixing/improving the diff minimiser script was to try just maintaining the ANSI colouring of the input lines, and outputting the lines with the same colouring at the end.