Open michaelskyba opened 6 months ago
According to a random post I read on Hacker News, difftastic uses tree-sitter too, so I suppose it would be general enough for this use case https://tree-sitter.github.io/tree-sitter/
Since common generic ones like e, t, n, "", and "." would show up frequently in any context, they would have already been seen in the past, and therefore filtered out. You'd more focus on the e.g. "Use Up and Down to choose options"
@michaelskyba For the variable names like e, t, n
, etc; if you're using an AST parser they'll show as a different node type to proper ""
strings, so they'll be pretty easy to avoid.
with some kind of convenient way to jump back to see it in-context in the code.
@michaelskyba An easy way to connect from those strings back to the context in the code would definitely be interesting. Almost feels like something you could do with monaco
editor (the core of vscode
), and symbol links/similar:
I know @j4k0xb has done some cool stuff with using monaco as the web IDE for webcrack
:
And I think @pionxzh 's wakaru
also uses monaco
(or plans to for the newer IDE; I can't remember exactly off the top of my head):
As a first exploration/PoC, I would probably consider the contextual linking back to be out of scope though, as it has the ability to quickly become a complicated rabbithole/distraction from the core features.
For large diffs I think it'd be a lot, but strings and names are a subset of the raw diff, so it should still be less work than a full manual analysis. The idea is to just visually filter through them until you see a name/string that looks interesting on their own, which could lead to something good in-context.
@michaelskyba Mm, makes sense.
Are there other AST parsers too? Would something like TreeSitter work? I'd generally prefer to avoid node.js if it's not required
@michaelskyba Yeah, there are definitely lots of AST parsers out there. Technically you could probably use TreeSitter or similar for this.
I'm curious what your reasoning for wanting to avoid using node is though? Pretty much all of the scripts and automation in this repo currently are currently written in node/JS/TS; for many use cases it's a pretty useful tool.
If you want to go deep on this subject, there's heaps of stuff on my gists, eg.:
According to a random post I read on Hacker News, difftastic uses
tree-sitter
too, so I suppose it would be general enough for this use case
@michaelskyba Yup, tree-sitter
is pretty generic.
Yeah, I want to make a prototype and see if it will kind of work. I'm still not sure on the implementation, though; the most efficient system might be to integrate with a text editor, which makes it harder to be replicable
@michaelskyba If you're happy to use node and babel parser/etc, then I can definitely help you get started with it. There are already some examples within this repo (albeit not using traverse I don't think):
There are a few 'AST Explorer' tools that are super useful when working with AST's as well:
I would personally start by writing it as a standalone script that just writes to the console or a file, that way you can quickly 'prove the concept' and see if it will be beneficial. Then later on, if you wanted to enhance the UX (user experience) of using it by integrating it into a text editor / IDE / GUI / etc, that could be a path for further exploration.
Here's a quick/dirty example that I hacked together with ChatGPT + some patterns from existing code. I haven't deeply tested it works properly, but from a quick skim through + basic test, it looks like it should probably work, or be close to doing so:
The main parts to pay attention to are parseCodeToAST
and extractStrings
; the logic within them might need some tweaking/etc though; and it could no doubt be extended to better capture the context they're being extracted from/etc to make it easier to jump back to them later, etc.
While the above example shows that the concept works, running it on a larger file ends up with a LOT more output content (eg. 57k lines in this case):
⇒ ./scripts/extract-js-strings.js unpacked/_next/static/chunks/pages/_app.js | wc -l
57025
Obviously most of that wouldn't change between builds though, so would end up being a lot less all in all.
Edit: Committed the above basic script in 210a87098dfca4d69dfcff9642fbefdb74f8b514 for future reference/usage.
Then you would just diff that output file of strings between one build and the next. If code moves around between builds it might introduce it’s own form of noise (but maybe
git diff —color-moved
would handle that still anyway)I haven't seen enough diffs to exactly anticipate how these would look like but there might be different solutions like
color-moved
that could work depending on how it goes
Something like this could also be useful for reducing the 'noise' of code that moves around:
As an alternative, you can use VS Code to review while highlighting moved code. It's available behind an experimental flag. It also works in githubs online version. So, simply press
.
in an open PR and review there.
This iteration we polished the moved code detection feature. It can be enabled with
"diffEditor.experimental.showMoves": true
or in the diff editor context menu. When enabled, code blocks that are moved from one location to a different location in the same file are detected and arrows are drawn to indicate where the code blocks moved to.Code moves are also detected when they are slightly modified. The Compare button can be used to compare the block before and after the move.
An easy way to connect from those strings back to the context in the code would definitely be interesting. Almost feels like something you could do with monaco editor (the core of vscode), and symbol links/similar:
Yeah, possibly; that's the kind of thing I was thinking could be useful as a text editor integration. But I think that would make sense as a wrapper around the main script. The output itself could maybe have the line numbers to the left of each finding, which would be parsed by the wrapper
I'm curious what your reasoning for wanting to avoid using node is though? Pretty much all of the scripts and automation in this repo currently are currently written in node/JS/TS; for many use cases it's a pretty useful tool.
I guess I just personally find the JS/npm ecosystem pretty annoying compared to e.g. Python or Go. It's not a dealbreaker though
If you want to go deep on this subject, there's heaps of stuff on my gists
Yeah, looks useful, I linked to it as part of my notes for the ChatGPT frontend
Here's a quick/dirty example that I hacked together with ChatGPT + some patterns from existing code
Cool, that's pretty good
I still want to look at Tree-sitter in case I like it more, but this kind of API from @babel/traverse
looks fine enough
function extractStrings(ast) {
const strings = [];
traverse(ast, {
StringLiteral: ({ node }) => {
strings.push(node.value);
},
TemplateLiteral: ({ node }) => {
node.quasis.forEach((element) => {
strings.push(element.value.cooked);
});
},
});
return strings;
}
Looks like building .so language files from source is deprecated in py-tree-sitter, but there are no official python bindings for the javascript grammar. Welp, that's annoying...
The third-party Go bindings in https://github.com/smacker/go-tree-sitter look more promising maybe
Yeah, go-tree-sitter seems to work. Counting lines should be possible by looking at sourceCode
.
From the testing I've done to this point, I think would prefer go-tree-sitter to Babel, so I'll continue developing the formatting here. It'd still be useful to have your extract-js-strings.js for reference, though.
Another approach I used in the past is renaming every variable and its references etc to a
so the diff can detect actual changes pretty reliably:
and ignore all name changes caused by minifying:
I wonder if its possible to get all the diff positions and then render it using the original names
I wonder if its possible to get all the diff positions and then render it using the original names
That sounds like it could indeed save time on skipping some changes, but I'd expect some minification diffs to be structurally different too, such that the same values are written differently, even if the actual name is the same. But I expect @0xdevalias would have a more informed perspective
In other news, I got a basic byte tracker that checks for newlines, to map each name properly:
1:e
1:t
1:n
2:"use strict"
3:a
4:r
4:n
5:s
6:s
6:Object
6:assign
7:Object
7:assign
7:bind
8:e
9:t
9:t
9:arguments
9:length
9:t
afaik minifiers should produce the exact same structure for unmodified regions of the source
afaik minifiers should produce the exact same structure for unmodified regions of the source
Oh, that's more convenient, then. Does the formatter (biome/prettier) have the same property?
depends, they sometimes introduce additional line breaks, parentheses, " " around all properties if one changes, etc
RE: https://github.com/0xdevalias/chatgpt-source-watch/issues/10#issuecomment-2015460661
But I think that would make sense as a wrapper around the main script. The output itself could maybe have the line numbers to the left of each finding, which would be parsed by the wrapper
@michaelskyba Yeah definitely. Having a script that is able to be run standalone, but that can also be wrapped into an IDE/other tooling for a nicer GUI is how I would look at approaching it myself.
--
I guess I just personally find the JS/npm ecosystem pretty annoying compared to e.g. Python or Go. It's not a dealbreaker though
@michaelskyba Mm, fair enough. I find they all have their own issues and strengths, but don't tend to find one is objectively better than the other in most cases. Each to their own though.
--
I still want to look at Tree-sitter in case I like it more, but this kind of API from @babel/traverse looks fine enough
@michaelskyba Mm, makes sense.
RE: https://github.com/0xdevalias/chatgpt-source-watch/issues/10#issuecomment-2015650726
Yeah, go-tree-sitter seems to work. Counting lines should be possible by looking at
sourceCode
.
@michaelskyba Nice!
RE: https://github.com/0xdevalias/chatgpt-source-watch/issues/10#issuecomment-2015666180
From the testing I've done to this point, I think would prefer go-tree-sitter to Babel, so I'll continue developing the formatting here. It'd still be useful to have your
extract-js-strings.js
for reference, though.
@michaelskyba That's fair enough. extract-js-strings.js
is committed in this repo, and no plans to remove it, so will be there for reference.
Curious, aside from not liking node/etc, what aspects of go-tree-sitter
do you find preferable/more useful than Babel? Or is it just that it's not node?
RE: https://github.com/0xdevalias/chatgpt-source-watch/issues/10#issuecomment-2015789399
Another approach I used in the past is renaming every variable and its references etc to a so the diff can detect actual changes pretty reliably
@j4k0xb Yeah, that is more or less the same concept I'm following with my diff minimiser; though in the current PoC version of it, I'm 'stabilising' the identifier names after running an initial diff on it, rather than doing so pre-diff.
I still haven't cleaned it up + pushed the code here yet, but you can see some snippets of updates from when I was last working on it here:
--
I wonder if its possible to get all the diff positions and then render it using the original names
@j4k0xb If by 'get the diff positions' you mean the sections that are still diff after stabilising the names, then that is more or less how my diff minimiser script is doing things currently.
It basically ingests a diff, processes the added/removed chunks with an AST parser to normalise any identifiers, and then if the 'identifier normalised' code is equal, then it removes that chunk from the diff that is output at the end. That way the identifiers are only ever changed within the minimiser check itself, but you still end up with the original names in the output.
RE: https://github.com/0xdevalias/chatgpt-source-watch/issues/10#issuecomment-2015799139
I'd expect some minification diffs to be structurally different too, such that the same values are written differently, even if the actual name is the same.
@michaelskyba A lot of the time the majority of the code remains stable.
Occasionally I'll see webpack chunks move around a bit, and/or change the chunk ID that is referring to them/etc, but I think that usually only happens when related code has actually changed in the build, and then that causes them to be reworked when they are minimised/etc.
I pre-process the code with prettier
/similar (switched from prettier to biome
for speed, but it's essentially doing the same job) to stabilise the layout/formatting/etc; and I run the formatter twice because sometimes it seems to not fully work when only run once (at least that seemed to be the case with prettier
)
--
In other news, I got a basic byte tracker that checks for newlines, to map each name properly
@michaelskyba Nice!
RE: https://github.com/0xdevalias/chatgpt-source-watch/issues/10#issuecomment-2015828448
afaik minifiers should produce the exact same structure for unmodified regions of the source
@j4k0xb That aligns with my thoughts/experience with them too.
RE: https://github.com/0xdevalias/chatgpt-source-watch/issues/10#issuecomment-2015843139
Does the formatter (biome/prettier) have the same property?
@michaelskyba Generally speaking, yeah, they tend to be pretty 'stable' in that regard (with the slight caveat that I mentioned above RE: needing to sometimes run them twice)
RE: https://github.com/0xdevalias/chatgpt-source-watch/issues/10#issuecomment-2015847277
depends, they sometimes introduce additional line breaks, parentheses, " " around all properties if one changes, etc
@j4k0xb I don't think I've seen this happen, or if I have, it's been so rare that I don't remember it/notice it.
The main times I have noticed an issue is when I need to run it twice because it failed to properly format everything the first time; and when I accidentally change the settings I am passing to it (like from using "
to using '
), in which case it definitely produces a lot of excess noise.
Curious, aside from not liking node/etc, what aspects of go-tree-sitter do you find preferable/more useful than Babel? Or is it just that it's not node?
I didn't investigate Babel enough to have an even somewhat-informed opinion, but in my head, the interface that I wanted was to write a recursive function that would DFS through the tree, which is what I ended up being able to make easily with go-tree-sitter. From what I can tell just from seeing extract-js-strings.js
, Babel traverse already includes its own traverse function, and takes objects that specify which code to run depending on the type of node found. This is less intuitive to me, but maybe there is a way to implement it in the former style in Babel too. All else equal, I'd prefer Go to node, though, yeah.
--
Something I've realized is that diffs of JavaScript files are further from valid JavaScript than I thought, such that it's impossible for tree-sitter to parse them out of the box. I think the way it would have to work is to extract the strings/identifiers of the new file and grep --invert-match
against all previous findings.
But I'd like to view them in the context of the diff rather than just the new file. Just by going to a line and being told that xxx is new, the context of the change isn't obvious. So maybe there's also a way to find, given a line L in the new file, which line in the diff shows L being added. I'd expect it to be more complicated than a grep --fixed-strings
because the same line could have been added multiple times.
It basically ingests a diff, processes the added/removed chunks with an AST parser to normalise any identifiers, and then if the 'identifier normalised' code is equal, then it removes that chunk from the diff that is output at the end. That way the identifiers are only ever changed within the minimiser check itself, but you still end up with the original names in the output.
How does your AST parse the chunks of those diffs? For a diff like this:
diff --git a/1.js b/2.js
index ad8dee1..923ec19 100644
--- a/1.js
+++ b/2.js
@@ -1,2 +1,2 @@
-console.log("bar \
+console.log("foo \
bar")
, I don't see an obvious way to dodge the string quoting, besides mapping the changed lines back to the files they came from, which itself sounds difficult.
I didn't investigate Babel enough to have an even somewhat-informed opinion, but in my head, the interface that I wanted was to write a recursive function that would DFS through the tree, which is what I ended up being able to make easily with
go-tree-sitter
. From what I can tell just from seeingextract-js-strings.js
, Babel traverse already includes its own traverse function, and takes objects that specify which code to run depending on the type of node found. This is less intuitive to me, but maybe there is a way to implement it in the former style in Babel too.
@michaelskyba As far as I am aware, babel-traverse
implements an AST visitor pattern, which does a DFS through the AST tree. So I believe you basically just re-invented the wheel there.
It doesn't inherently have to specify which type of node to run against, it could run for all of them (you can see the most basic example using enter
in the links above), but given we only wanted to match certain node types, it made the most sense to specify that filter up front.
You may find that last section on AST traversal a useful read even if you're not using Babel specifically, as it's a pretty common pattern from what I have seen.
Even with go-tree-sitter
, you may be able to get closer to that functionality without having to manually brute force your own recursion through things, eg:
Something I've realized is that diffs of JavaScript files are further from valid JavaScript than I thought, such that it's impossible for
tree-sitter
to parse them out of the box.
@michaelskyba Different in what sense? Because the parts of the diff included aren't necessarily fully formed valid JavaScript?
I think the way it would have to work is to extract the strings/identifiers of the new file and
grep --invert-match
against all previous findings.
@michaelskyba Not sure I am following here. Why do you feel this would need to be done; and/or can you demonstrate your thoughts with an example to help explain it?
So maybe there's also a way to find, given a line L in the new file, which line in the diff shows L being added. I'd expect it to be more complicated than a
grep --fixed-strings
because the same line could have been added multiple times.
@michaelskyba Also not sure I'm fully following your logic here. Maybe the same clarity from explaining the above will help me here too?
How does your AST parse the chunks of those diffs? For a diff like this:
diff --git a/1.js b/2.js index ad8dee1..923ec19 100644 --- a/1.js +++ b/2.js @@ -1,2 +1,2 @@ -console.log("bar \ +console.log("foo \ bar")
, I don't see an obvious way to dodge the string quoting, besides mapping the changed lines back to the files they came from, which itself sounds difficult.
@michaelskyba What aspect of that are you asking about? The fact that the string is split over 2 lines, where one hasn't changed, and thus wouldn't be seen in the added/removed 'pair'; such that the code to be parsed would end up being just console.log("foo \
?
I guess my first question would be whether that is ever a case that would exist after being prettified with prettier
/biome
/etc; but assuming it still would; for my needs, that is why I went from the standard 'strict' parser, to a more lenient parser + 'hacky code fixes' as vaguely mentioned here:
- https://twitter.com/_devalias/status/1753711267137892684
A few more updates that I posted elsewhere as I was going; from
7100
, to6455
, to4268
, and now down to3913
lines.Added a few more ‘pre AST parsing’ fixups that get the diff down even further: from the previous
7100
lines to6455
linesTrialled parsing the AST with a more lenient parser as a pre-step; filtering out unknown bits, then generating that back into code; that I then apply some of my previous ‘preparation hacks’ to, before parsing it with the more strict parser and doing the identifier normalisation.
That’s currently reduced the
6455
lines down to4268
lines; and there are still leftover parsing errors that could potentially get that number lower still that I’m chipping away at.Another alternative could be to completely replace the stricter parser with the more lenient one; but then I would have to rewrite a bunch of other code as well.
I could also potentially try a different lenient parser; or pre-process the code being passed to this one; as I believe it’s failing to parse some statements even; which is potentially leading to more of that noise filtering through..
I added a pre-processing step to the lenient parser which improved things a bit; then I added an actual fix to the AST from the lenient parser to help fix more complex ternaries that were challenging to fix in my simpler string append pre/post fix methods.
With those, we’re now down to
3913
lines.Being able to properly detect and fix object properties that are missing their wrapping object would probably fix a bunch more of the issues.
Though maybe when I look at this next; I’ll just see whether using the lenient parser on its own gives a quicker/better result.
Originally posted by @0xdevalias in https://github.com/0xdevalias/chatgpt-source-watch/issues/3#issuecomment-1925508005
After that, I did end up trying the lenient parser on it's own, which still had some issues even after adding the hacky fixes to it; and then I believe I trialled using just the tokenizer from the lenient parser, which from memory ended up working pretty well for my needs, since it was still able to parse the Identifier
nodes that I was needing to normalise.
Though for your use case of extracting the strings, I don't think parsing the diffs is the right approach. I would probably just parse the raw JS, and dump the full lot of strings out to a .txt file; then I would diff that whole text file of the latest version, against the whole text file of the previous version.
A lot of this will probably be uninteresting/useless noise, but I just made a repo and pushed all of my hacky WIP/PoC AST exploration scripts to it so they're captured in one place:
PoC scripts and tools for working with (primarily JavaScript) ASTs.
What might be of interest though, is that I pushed the current hacky state of my diff minimiser code there too. I haven't checked it for a good few weeks, so can't even remember if it's in a runnable state currently, but if nothing else, there might be some ideas hidden away in there that help explain what I was referring to above:
When I get a chance to get back to them, I plan to finish my research/refinements on the best method(s), and then clean it all up back to a single useful script that I will commit to this repo. But until then, those may be of some interest to someone.
@michaelskyba As far as I am aware, babel-traverse implements an AST visitor pattern, which does a DFS through the AST tree. So I believe you basically just re-invented the wheel there.
It doesn't inherently have to specify which type of node to run against, it could run for all of them (you can see the most basic example using enter in the links above), but given we only wanted to match certain node types, it made the most sense to specify that filter up front.
Alright, yeah, Babel's system makes more sense now in the context of defining a visitor. I'd still prefer to use a general enter function and then have something like an if statement, but overall I could see it being more convenient than manual recursion.
traverse(ast, {
enter(path) {
if (
path.node.type === "Identifier" &&
path.node.name === "n"
) {
path.node.name = "x";
}
}
});
I initially disliked tree-sitter's predicates when I first saw them but I've looked at some of the go.dev docs and maybe they would be cleaner, since I should still be able to track the source iteratively with
Now that I think about it, if this is in the context of being added to the scripts/
directory or similar, then it would make sense to follow what matches the existing style as best as possible. I haven't finalized the general design yet but I'll try to be more cognizant of that.
Something I've realized is that diffs of JavaScript files are further from valid JavaScript than I thought, such that it's impossible for
tree-sitter
to parse them out of the box.@michaelskyba Different in what sense? Because the parts of the diff included aren't necessarily fully formed valid JavaScript?
Yes, that arbitrary segments taken from a JavaScript file won't be fully formed valid JavaScript by themselves. But also the surrounding diff syntax of e.g.
diff --git a/original.js b/vendor-6a62821ae00843ea.js?dpl=576b52374bc8a5e2b6fbb183818b6924a40171a2
index 05cb4d0..57d05c3 100644
--- a/original.js
+++ b/vendor-6a62821ae00843ea.js?dpl=576b52374bc8a5e2b6fbb183818b6924a40171a2
That could be filtered out depending on the output structure, though.
I think the way it would have to work is to extract the strings/identifiers of the new file and
grep --invert-match
against all previous findings.@michaelskyba Not sure I am following here. Why do you feel this would need to be done; and/or can you demonstrate your thoughts with an example to help explain it?
Example: We have three version of a JS file: 0.js, 1.js, 2.js
The idea I'm describing in that quote is to do the following:
This is the alternative to trying to first look at the diff between 0.js and 1.js or 1.js and 2.js, and then running the script on that diff as input. With the former idea of considering each file separately, the only input being passed to the AST program will be valid JavaScript, so it should always work as intended.
The downside is that if you look into 2.txt and see e.g. "DynamicModelSwitcherEnabled", you'd want to see which part of the 1.js --> 2.js diff shows that being introduced, which 2.txt doesn't tell you directly.
So maybe there's also a way to find, given a line L in the new file, which line in the diff shows L being added. I'd expect it to be more complicated than a
grep --fixed-strings
because the same line could have been added multiple times.@michaelskyba Also not sure I'm fully following your logic here. Maybe the same clarity from explaining the above will help me here too?
Here, I'm talking about a potential solution to the downside I mentioned above. If 2.txt can report that "DynamicModelSwitcherEnabled" was found on line L in 2.js, it could be possible to look at the diff between 1.js and 2.js, and find where line L is added.
Hopefully that makes a bit more sense?
@michaelskyba What aspect of that are you asking about? The fact that the string is split over 2 lines, where one hasn't changed, and thus wouldn't be seen in the added/removed 'pair'; such that the code to be parsed would end up being just
console.log("foo \
?
Yes, which would be completely unparsable out of the box, as the rest of the code in the diff would be incorrectly considered a string. This was meant to be an example of how an AST parser can't pass a raw provided diff.
If you mean to not parse the file itself but to split it up into each diff chunk, and then to parse the removed / added lines separately, then that could potentially work better. I haven't tried that with tree-sitter.
I guess my first question would be whether that is ever a case that would exist after being prettified with
prettier
/biome
/etc
I don't think this exact case of a string being split across multiple lines occurs in the real code, but something similar happened when I was first testing that broke most of the parsing. I might check again later if the exact context would be useful.
but assuming it still would; for my needs, that is why I went from the standard 'strict' parser, to a more lenient parser + 'hacky code fixes' as vaguely mentioned here:
After that, I did end up trying the lenient parser on it's own, which still had some issues even after adding the hacky fixes to it; and then I believe I trialled using just the tokenizer from the lenient parser, which from memory ended up working pretty well for my needs, since it was still able to parse the
Identifier
nodes that I was needing to normalise.
I see, that should work, but it confirms that there is no simple, universal solution.
Though for your use case of extracting the strings, I don't think parsing the diffs is the right approach. I would probably just parse the raw JS, and dump the full lot of strings out to a .txt file; then I would diff that whole text file of the latest version, against the whole text file of the previous version.
Yeah, that is essentially what I was thinking, except I wanted to only search for completely new entries. I'll see if I can make the string/identifier --> find insert line system work.
Babel's system makes more sense now in the context of defining a visitor. I'd still prefer to use a general enter function and then have something like an if statement
@michaelskyba nods, yeah, that's fair.
Deciding whether it's better to use a raw enter
+ if
pattern, or the basic 'single node type' pattern (like I used in my PoC, eg. StringLiteral(path) { /* ... */ }
), or even seemingly 'multiple node types' (which I only learned about after linking the docs above, eg. "StringLiteral|TemplateLiteral"(path) { /* ... */ }
) would largely depend on what the rest of the logic you need to implement will be I would suspect.
I initially disliked
tree-sitter
's predicates when I first saw them but I've looked at some of the go.dev docs and maybe they would be cleaner, since I should still be able to track the source iteratively with
@michaelskyba Quickly skimming through the docs on the using predicate S-expressions:
For my own reference, it looks like you:
sitter.NewQuery
sitter.NewQueryCursor
QueryCursor
qc.NextMatch
QueryMatch
QueryCapture
sNode
And then yeah, there seem to be heaps of useful things that Node
gives you access to, including the node type, source, location within the source, etc.
I think this method is somewhat close to my 'named node type' visitor pattern that I used in my previous Babel example PoC.
Now that I think about it, if this is in the context of being added to the
scripts/
directory or similar, then it would make sense to follow what matches the existing style as best as possible. I haven't finalized the general design yet but I'll try to be more cognizant of that.
@michaelskyba Yeah, if your goal is to include what you're working on in this repo, then matching the existing style of JS + Babel is ideal (unless there is a good reason to differ from it).
Though if this just ends up being your own personal exploration for learning, or to create a tool in it's own repo, obviously you can choose whatever language/lib choices you want to use/explore.
Yes, that arbitrary segments taken from a JavaScript file won't be fully formed valid JavaScript by themselves. But also the surrounding diff syntax
@michaelskyba nods, makes sense. So in my WIP PoC diff minimiser that I recently uploaded (Ref) I first parse the diff format using parse-diff
, and then I do the AST parsing on the source chunks from within that parsed structure; which as mentioned above, still has the issue of 'arbitrary segments of javascript'.
Yes, which would be completely unparsable out of the box, as the rest of the code in the diff would be incorrectly considered a string. This was meant to be an example of how an AST parser can't pass a raw provided diff.
@michaelskyba Yup, makes sense.
If you mean to not parse the file itself but to split it up into each diff chunk, and then to parse the removed / added lines separately, then that could potentially work better. I haven't tried that with tree-sitter.
@michaelskyba Yeah, that was the approach I took in my WIP PoC diff minimiser (Ref):
A very rough walkthrough:
main
function calls parseArgsAndReadInput
to read in the raw diff (from a git diff
/similar)parseDiff
to parse the diff into objectsgroupChunkChanges
is called to group the added/removed lines within a chunkgroupedChanges
are then iterated through, and compared with compareNormalizedCodeDiffs
It's currently quite messy/hacky code, with multiple implementations scattered throughout as I was exploring different ways of doing things (eg. compareNormalizedCodeDiffs
, compareNormalizedCodeDiffsV2
, etc)
It looks like the main AST parsing/comparison logic is in normalizeIdentifierNamesInCode
/ normalizeIdentifierNamesInCodeV2
/ normalizeIdentifierNamesInCodeV3
You can also see the pre/post-processing steps I was taking to try and hack the 'arbitrary bits of JavaScript' into something that would parse in prepareCodeForASTParsing
/ prepareCodeForASTParsingV2
/ prepareCodeForASTParsingV2PreProcess
/ prepareCodeForASTParsingV2PostProcess
/ etc
But after all of that, I think my ultimate conclusion (that I have yet to fully test/polish/turn into my final tool) was that I think I can get pretty good results, with minimal pre/post-hacking required, by just using acorn
's tokenizer (Ref), rather than fully parsing the code into an AST.
tokenizer(input, options)
returns an object with agetToken
method that can be called repeatedly to get the next token, a{start, end, type, value}
object (with addedloc
property when thelocations
option is enabled andrange
property when theranges
option is enabled). When the token's type istokTypes.eof
, you should stop calling the method, since it will keep returning that same token forever.Note that tokenizing JavaScript without parsing it is, in modern versions of the language, not really possible due to the way syntax is overloaded in ways that can only be disambiguated by the parse context. This package applies a bunch of heuristics to try and do a reasonable job, but you are advised to use
parse
with theonToken
option instead of this.
I need to play around with that more, and may end up going back to another library (eg. if I can use Babel's tokenizer or similar directly). I originally was looking at acorn
as it has acorn-loose
(Babel didn't seem to have an equivalent loose parser), but it's a somewhat outdated parser I believe:
This parser will parse any text into an ESTree syntax tree that is a reasonable approximation of what it might mean as a JavaScript program.
It will, to recover from missing brackets, treat whitespace as significant, which has the downside that it might mis-parse a valid but weirdly indented file. It is recommended to always try a parse with the regular acorn parser first, and only fall back to this parser when that one finds syntax errors.
I don't think this exact case of a string being split across multiple lines occurs in the real code, but something similar happened when I was first testing that broke most of the parsing. I might check again later if the exact context would be useful.
@michaelskyba nods, fair, makes sense.
I see, that should work, but it confirms that there is no simple, universal solution.
@michaelskyba Well, I guess in the end, depending on what you mean by 'universal solution', I think the tokenizer approach is likely pretty close to a 'universal solution'.
Yeah, that is essentially what I was thinking, except I wanted to only search for completely new entries.
@michaelskyba Curious why you want to 'only search for completely new entries' at this stage of things within the AST parsing tool? I believe you'll get basically the same outcome (with simpler code) by embracing the unix philosophy (Ref) of combining tools. That way you can have the AST parsing script output all of the strings, and then just do a standard git diff
/ similar between those 'full string extracted files' (as I described earlier); which would then (I believe) produce the view of the 'only completely new entries' that you're after.
I'll see if I can make the string/identifier --> find insert line system work.
@michaelskyba Curious to see your approach here if you do get it to work.
(As just because the method I described above is how I believe would be simplest to implement it, doesn't mean that there isn't something i'm missing/unaware of/etc that would end up being an even better way)
Example: We have three version of a JS file: 0.js, 1.js, 2.js
The idea I'm describing in that quote is to do the following:
- We extract the strings/identifiers of 0.js and put them into 0.txt.
- Then, we extract strings/identifiers of 1.js, but only keep the ones not in 0.txt, and put those in 1.txt. Now, you could look at 1.txt to see the new strings/identifiers introduced in the 0.js --> 1.js revision.
- Then, we extract the strings/identifiers of 2.js, but only keep the ones not in 0.txt, and not in 1.txt, and put those in 2.txt. Now, this will be the new strings/identifiers introduced in the 1.js --> 2.js revision.
- etc.
This is the alternative to trying to first look at the diff between 0.js and 1.js or 1.js and 2.js, and then running the script on that diff as input. With the former idea of considering each file separately, the only input being passed to the AST program will be valid JavaScript, so it should always work as intended.
The downside is that if you look into 2.txt and see e.g. "DynamicModelSwitcherEnabled", you'd want to see which part of the 1.js --> 2.js diff shows that being introduced, which 2.txt doesn't tell you directly.
@michaelskyba nods, right, that makes sense. Though I'm curious, how do you see the method I described above/previously comparing with a pattern like this (better/worse/etc?)? Where you basically:
build-0.js
and put them in 0.txt
build-1.js
and put them in 1.txt
build-2.js
and put them in 2.txt
build-0
and build-1
we can just diff 0.txt 1.txt
build-1
and build-2
we can just diff 1.txt 2.txt
build-0
and build-2
with just diff 0.txt 2.txt
If this '[filename]-strings.txt' file was committed alongside each *.js
file in each build in the git repo, then you would be able to diff any arbitrary build version against any other version without having to re-run the string extraction script.
Or if we didn't want to store those files in the git repo, we could still diff any arbitrary build version against another, but would just need to re-run the extraction script each time. I haven't tested this, so it may need slight tweaks, but this is the basic command ChatGPT drafted for doing this when I asked:
diff \
<(git show commit1:path/to/example.js | node extract-strings.js) \
<(git show commit2:path/to/example.js | node extract-strings.js)
Here, I'm talking about a potential solution to the downside I mentioned above. If 2.txt can report that "DynamicModelSwitcherEnabled" was found on line L in 2.js, it could be possible to look at the diff between 1.js and 2.js, and find where line L is added.
Hopefully that makes a bit more sense?
@michaelskyba Right, yup, I think I'm following now.
Curious if the method I described above is a satisfactory solution to that, or if you would still be wanting to try and manually keep track of this sort of data between the 'pre-diffed' *.txt
files in the method you described previously?
Quickly skimming through the docs on the using predicate S-expressions:
@michaelskyba I haven't tested this at all, so it might not be perfectly correct, but this is what ChatGPT drafted up based on the method I was talking about with using the predicate function of go-tree-sitter
:
package main
import (
"fmt"
"io/ioutil"
"log"
"github.com/smacker/go-tree-sitter"
"github.com/smacker/go-tree-sitter/javascript"
)
func main() {
// Load your JavaScript code from a file
content, err := ioutil.ReadFile("your_file.js")
if err != nil {
log.Fatalf("Failed to read file: %v", err)
}
// Parse the content using tree-sitter
parser := sitter.NewParser()
parser.SetLanguage(javascript.GetLanguage())
// Create a tree from the content
tree := parser.Parse(nil, content)
defer tree.Delete()
// Prepare the query
queryString := `(string) @string`
// Compile the query
query, err := sitter.NewQuery([]byte(queryString), javascript.GetLanguage())
if err != nil {
log.Fatalf("Failed to create query: %v", err)
}
defer query.Delete()
// Create a query cursor
cursor := sitter.NewQueryCursor()
defer cursor.Delete()
// Execute the query
cursor.Exec(query, tree.RootNode())
// Iterate over the matches
for {
match, ok := cursor.NextMatch()
if !ok {
break
}
for _, capture := range match.Captures {
// Get the node for the current capture
node := capture.Node
// Print the content of the string literal
fmt.Println(node.Content(content))
}
}
}
This isn't specifically related, but I described it poorly initially, so to clarify, I think identifiers could be useful indicators too, and not just strings, so I'm trying to extract both types.
@michaelskyba Yeah, if your goal is to include what you're working on in this repo, then matching the existing style of JS + Babel is ideal (unless there is a good reason to differ from it).
Though if this just ends up being your own personal exploration for learning, or to create a tool in it's own repo, obviously you can choose whatever language/lib choices you want to use/explore.
Hmm, that's a good point. Using Go or tree-sitter at all is probably already a more significant departure than the actual style of manual recursion vs visitor, or if statement vs node type pattern. I might just continue storing it in mine for now, and then perhaps make a rewrite in the future if it would make sense.
@michaelskyba nods, makes sense. So in my WIP PoC diff minimiser that I recently uploaded (Ref) I first parse the diff format using
parse-diff
, and then I do the AST parsing on the source chunks from within that parsed structure; which as mentioned above, still has the issue of 'arbitrary segments of javascript'.
- The
main
function callsparseArgsAndReadInput
to read in the raw diff (from agit diff
/similar)- It then calls
parseDiff
to parse the diff into objects- It then iterates through the files in the diff, then the diff chunks in each of those files
groupChunkChanges
is called to group the added/removed lines within a chunk- Those
groupedChanges
are then iterated through, and compared withcompareNormalizedCodeDiffs
- Then later on a bunch of stuff is done to basically remove the changes that were considered equal, then output the newly filtered diff in the same format as was originally read in
It's currently quite messy/hacky code, with multiple implementations scattered throughout as I was exploring different ways of doing things (eg.
compareNormalizedCodeDiffs
,compareNormalizedCodeDiffsV2
, etc)
That makes more sense now, and is pretty sophisticated. Despite the chunks being arbitrary segments, I'd expect it to perform significantly better out of the box than running the AST on the raw diff. The string split across multiple lines scenario I mentioned, for example, shouldn't ever happen, because it is only caused by the second file's chunk
console.log("foo \
bar")
being parsed as if it was part of the first:
console.log("bar \
console.log("foo \
bar")
But after all of that, I think my ultimate conclusion (that I have yet to fully test/polish/turn into my final tool) was that I think I can get pretty good results, with minimal pre/post-hacking required, by just using
acorn
's tokenizer (Ref), rather than fully parsing the code into an AST.I need to play around with that more, and may end up going back to another library (eg. if I can use Babel's tokenizer or similar directly). I originally was looking at
acorn
as it hasacorn-loose
(Babel didn't seem to have an equivalent loose parser), but it's a somewhat outdated parser I believe:@michaelskyba Well, I guess in the end, depending on what you mean by 'universal solution', I think the tokenizer approach is likely pretty close to a 'universal solution'.
Oh, I see, I previously misunderstood what the context and role of the tokenizer was. Then yeah, I retract that; it should be general enough. In the case of only doing the extraction, it still looks like it'd be easier to run an AST before the diff, though.
@michaelskyba Curious why you want to 'only search for completely new entries' at this stage of things within the AST parsing tool? I believe you'll get basically the same outcome (with simpler code) by embracing the unix philosophy (Ref) of combining tools. That way you can have the AST parsing script output all of the strings, and then just do a standard
git diff
/ similar between those 'full string extracted files' (as I described earlier); which would then (I believe) produce the view of the 'only completely new entries' that you're after.
There I (ambiguously) meant the "search" in the context of the outer diff, not inside the same AST parsing program. I'm generally a fan of the Unix philosophy, and try to follow it when, for example, writing personal shell scripts.
I was thinking of directly using a standard git diff
between files, but I see a few issues with it.
All of those are pretty easy to solve, though.
Maybe it'd make the most sense to only store the line-numbered versions of the entries of each file. Then, diff could still be used by removing the line numbers on the fly with something like cut -d ":" -f 2
. For the completely new entries, we could have something similar (this literal example wouldn't work) to cat 0.txt 1.txt | cut -d ":" -f 2 | sort | uniq > $tmp
and then grep -v -f $tmp 2.txt
.
I'll see if I can make the string/identifier --> find insert line system work.
@michaelskyba Curious to see your approach here if you do get it to work.
(As just because the method I described above is how I believe would be simplest to implement it, doesn't mean that there isn't something i'm missing/unaware of/etc that would end up being an even better way)
I think this part should be compatible with your method. That was just an idea for finding the context of an entry inside the js --> js diff.
@michaelskyba nods, right, that makes sense. Though I'm curious, how do you see the method I described above/previously comparing with a pattern like this (better/worse/etc?)? Where you basically:
- extract all of the strings from
build-0.js
and put them in0.txt
- extract all of the strings from
build-1.js
and put them in1.txt
- extract all of the strings from
build-2.js
and put them in2.txt
- now to see the new strings between
build-0
andbuild-1
we can justdiff 0.txt 1.txt
- or to see the new strings between
build-1
andbuild-2
we can justdiff 1.txt 2.txt
- though this would also give us the flexibility to easily see the new strings between
build-0
andbuild-2
with justdiff 0.txt 2.txt
If this '[filename]-strings.txt' file was committed alongside each
*.js
file in each build in the git repo, then you would be able to diff any arbitrary build version against any other version without having to re-run the string extraction script.Curious if the method I described above is a satisfactory solution to that, or if you would still be wanting to try and manually keep track of this sort of data between the 'pre-diffed'
*.txt
files in the method you described previously?
Originally I was thinking of only storing new entries in each txt file, but I think you're right that having 2.txt
contain all the entries from build-2.js
would provide much more flexibility. However, as I mentioned above, I don't think diff
by itself would be sufficient for analyzing the resulting txt files.
I think the line numbers would have to be stored in the pre-diffed txt files, or there'd be a lot of additional complexity for locating lines in the js --> js diff.
Or if we didn't want to store those files in the git repo, we could still diff any arbitrary build version against another, but would just need to re-run the extraction script each time. I haven't tested this, so it may need slight tweaks, but this is the basic command ChatGPT drafted for doing this when I asked:
diff \ <(git show commit1:path/to/example.js | node extract-strings.js) \ <(git show commit2:path/to/example.js | node extract-strings.js)
Yeah, something like this could also work. I guess the convenience will depend on how long the extractor will take to run.
I think identifiers could be useful indicators too, and not just strings, so I'm trying to extract both types.
@michaelskyba Identifiers churning between builds are sort of specifically the thing that creates the noise we're trying to eliminate though..?
I might just continue storing it in mine for now, and then perhaps make a rewrite in the future if it would make sense.
@michaelskyba That's fair :)
That makes more sense now, and is pretty sophisticated. Despite the chunks being arbitrary segments, I'd expect it to perform significantly better out of the box than running the AST on the raw diff.
@michaelskyba nods it seems to work pretty well. Another alternative is to not do the text-diff first, and just do an 'AST diff' or similar (while ignoring identifier names changing), but I believe that would run a lot slower (or at least, the existing tools I tried definitely seemed to be slower)
In the case of only doing the extraction, it still looks like it'd be easier to run an AST before the diff, though.
@michaelskyba See comment above RE: this. Speed issues + laziness for not wanting to implement my own AST diff algorithm aside, that would probably be the 'ideal' 'pure' way to achieve this. I was trying to leverage a text based diff as a starting point for speed reasons.
I was thinking of directly using a standard
git diff
between files, but I see a few issues with it.
- I want to only look at the new entries and not the removed entries.
- If an entry is removed in a prior version of the code and then added back in the latest version, it would show up in the diff, but I would want it to be filtered out.
- The line numbers couldn't be directly included in the file, or else the diff would count the same entry on a different line as a different entry.
All of those are pretty easy to solve, though.
@michaelskyba Of those, 1 would be almost trivial to handle as is; 2 would be slightly more annoying (though I also wonder how often that case would come up in reality, and if it needs to be 'solved for'); 3 it seems you already have some decent ideas on how to work around.
That was just an idea for finding the context of an entry inside the js --> js diff.
@michaelskyba Makes sense :)
Originally I was thinking of only storing new entries in each txt file, but I think you're right that having
2.txt
contain all the entries frombuild-2.js
would provide much more flexibility. However, as I mentioned above, I don't thinkdiff
by itself would be sufficient for analyzing the resulting txt files.I think the line numbers would have to be stored in the pre-diffed txt files, or there'd be a lot of additional complexity for locating lines in the js --> js diff.
@michaelskyba nods yeah, makes sense.
Yeah, something like this could also work. I guess the convenience will depend on how long the extractor will take to run.
@michaelskyba nods yeah, exactly. I would assume it would run too slow and so be kind of annoying; but that could be improved by implementing a local cache (outside of the git repo) type thing such that for a given build, it only ever needs to run the actual parsing code once, and then would lookup from the cache after that (unless cleared).
Another reason I would think about committing the intermediary files is then you can sort of 'interactively explore' them through GitHub's diff interface; which was also one of the reasons I started committing the builds themselves (though the noise of the identifier churn between them sort of makes it a bit unusable for that purpose currently unfortunately)
@michaelskyba Identifiers churning between builds are sort of specifically the thing that creates the noise we're trying to eliminate though..?
I haven't looked at the diffs that much but from what I've seen, the churn only affects a limited number of the identifiers, such as the ones comprised of 1-2 characters like
e.d(t, {
HV: function () {
return N;
},
N8: function () {
return S;
},
aP: function () {
return m;
},
do: function () {
return R;
},
This is why I wanted to be able to look at introduced identifiers that have never been seen in any previous JavaScript file version. That way, these common churn identifiers would be found and eliminated, and then we'd only start seeing the more interesting ones like activityRegularizer
or checkGate
.
@michaelskyba nods it seems to work pretty well. Another alternative is to not do the text-diff first, and just do an 'AST diff' or similar (while ignoring identifier names changing), but I believe that would run a lot slower (or at least, the existing tools I tried definitely seemed to be slower)
@michaelskyba See comment above RE: this. Speed issues + laziness for not wanting to implement my own AST diff algorithm aside, that would probably be the 'ideal' 'pure' way to achieve this. I was trying to leverage a text based diff as a starting point for speed reasons.
Interesting, I didn't think about a diff on the AST level. But yeah, I guess it isn't a common enough use case for there to have been a lot of work already out there on matching the efficiency of text diffs.
@michaelskyba Of those, 1 would be almost trivial to handle as is; 2 would be slightly more annoying (though I also wonder how often that case would come up in reality, and if it needs to be 'solved for'); 3 it seems you already have some decent ideas on how to work around.
Number 2 is relevant to the identifier churn mentioned above. I would assume that something like aP
could be churned out in one revision and churned back in at a different revision.
@michaelskyba nods yeah, exactly. I would assume it would run too slow and so be kind of annoying; but that could be improved by implementing a local cache (outside of the git repo) type thing such that for a given build, it only ever needs to run the actual parsing code once, and then would lookup from the cache after that (unless cleared).
Yeah, if there's a way to store some piece of cache smaller than the full output that still significantly speeds up the execution time.
Another reason I would think about committing the intermediary files is then you can sort of 'interactively explore' them through GitHub's diff interface; which was also one of the reasons I started committing the builds themselves (though the noise of the identifier churn between them sort of makes it a bit unusable for that purpose currently unfortunately)
Yeah, somebody interested in some ~casual browsing would probably not want to clone the code and run the extractor themselves. Diffs between extraction output would probably be similarly difficult to look at directly, though.
I rewrote it (no line tracking yet) to use the following query pattern:
(string) @string
(template_string) @template
(identifier) @identifier
(property_identifier) @property
(shorthand_property_identifier) @shorthand
(comment) @comment
Here is the diff between the extractions (| sort | uniq) for two vendor- .js files I tested: https://gist.github.com/michaelskyba/fb93861fe6c4880138a586716e4beb18
If I only look for added entries rather than the diff itself, it's only 8 lines, not including the source map line that prettier added and the find_this_test_123
I added as a test:
$ rg -x -F -v -f 1713546501-cf14668063190b4c-uniq.txt 1713620903-c41f7bc7c59d56a3-uniq.txt
311:find_this_test_123
1396:"16 16 12 12 8 16"
9002:"&dpl=9133c19ee9ae73caaad3f40e3ac2f20dd10e0fdf"
9003:"?dpl=9133c19ee9ae73caaad3f40e3ac2f20dd10e0fdf"
13739:`headlessui-menu-button-${(0, p.M)()}`
13740:`headlessui-menu-item-${(0, p.M)()}`
13741:`headlessui-menu-items-${(0, p.M)()}`
18089:"M20.39 18.39A5 5 0 0 0 18 9h-1.26A8 8 0 1 0 3 16.3"
24471:Re4
27786://# sourceMappingURL=vendor-c41f7bc7c59d56a3.js.map
(Edit: I realize now that sourceMappingURL is part of the original file and was not added by Prettier.)
The diff between the actual JS files was 13622 lines and just looked like identifier churn to me. Snippet:
@@ -140080,7 +139375,7 @@ WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE
return nS(e).getPropertyValue(t);
}
let nA = ["top", "right", "bottom", "left"];
- function nx(e, t, n) {
+ function nT(e, t, n) {
let r = {};
n = n ? "-" + n : "";
for (let i = 0; i < 4; i++) {
@@ -140089,14 +139384,14 @@ WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE
}
return (r.width = r.left + r.right), (r.height = r.top + r.bottom), r;
}
- let nT = (e, t, n) => (e > 0 || t > 0) && (!n || !n.shadowRoot);
+ let nx = (e, t, n) => (e > 0 || t > 0) && (!n || !n.shadowRoot);
function nC(e, t) {
let n, r;
let i = e.touches,
o = i && i.length ? i[0] : e,
{ offsetX: a, offsetY: s } = o,
l = !1;
- if (nT(a, s, e.target)) (n = a), (r = s);
+ if (nx(a, s, e.target)) (n = a), (r = s);
I haven't looked at the diffs that much but from what I've seen, the churn only affects a limited number of the identifiers, such as the ones comprised of 1-2 characters like
This is why I wanted to be able to look at introduced identifiers that have never been seen in any previous JavaScript file version. That way, these common churn identifiers would be found and eliminated, and then we'd only start seeing the more interesting ones like
activityRegularizer
orcheckGate
.
@michaelskyba Mm, fair enough. There are definitely some that seem to not get minimised, but it's more like the majority of identifiers that get minimised (and thus churn) between builds.
Make sense RE: the 'only new' aspect. Though you could probably also eliminate a lot of that noise (with minimal false positives) by filtering out short identifiers that match the patterns used in minification.
I guess it isn't a common enough use case for there to have been a lot of work already out there on matching the efficiency of text diffs.
@michaelskyba nods true true. You can see some of the tools I explored in this issue:
Such as:
a structural diff that understands syntax
A tree-sitter based AST difftool to get meaningful semantic diffs
I would assume that something like aP could be churned out in one revision and churned back in at a different revision.
@michaelskyba Yeah, absolutely would be.
Yeah, if there's a way to store some piece of cache smaller than the full output that still significantly speeds up the execution time.
@michaelskyba It wouldn't even necessarily need to be smaller than the full output. The key difference as I see it is whether it would be a file that you would track within the repo/etc; or more of a 'hidden away in the cache' thing.
Obviously depending on your workflow/etc, you could just manually store the pre-calculated versions somewhere outside of a git repo anyway, and it would have a similar effect.
Yeah, somebody interested in some ~casual browsing would probably not want to clone the code and run the extractor themselves. Diffs between extraction output would probably be similarly difficult to look at directly, though.
@michaelskyba Mm, to some degree probably yeah. Though I guess it depends how much of a 'casual browser' we're talking about, and how much we reduce the noise in whatever output format it is.
My ideal goal would be to be able to re-write the identifiers to something stable (yet meaningful) between builds + be able to account for modules that 'move around', such that the diff shown on github (for the unpacked/
directory) would only show the actual meaningful code changes between each build.
Here is the diff between the extractions (| sort | uniq) for two vendor- .js files I tested
@michaelskyba It might be more interesting (and yet way larger) to look at the _app.js
file; as that is often where the bulk of interesting changes seem to be.
If I only look for added entries rather than the diff itself, it's only 8 lines, not including the source map line that prettier added and the find_this_test_1234 I added as a test
@michaelskyba Neat!
The diff between the actual JS files was 13622 lines and just looked like identifier churn to me.
@michaelskyba Would be curious to see what that same diff would look like running through my diff minimiser (even in it's current non-polished form):
Seemingly 104 lines of diff normally:
⇒ git diff 6d96656142aabcc862846ad7852422f0a8f14dbe:unpacked/_next/static/chunks/vendor.js 6eab9108d9ed3124b4ba757ee9f29e892082deb5:unpacked/_next/static/chunks/vendor.js | wc -l
104
And 66 after running through the minimiser:
⇒ git diff 6d96656142aabcc862846ad7852422f0a8f14dbe:unpacked/_next/static/chunks/vendor.js 6eab9108d9ed3124b4ba757ee9f29e892082deb5:unpacked/_next/static/chunks/vendor.js | ./scripts/ast-poc/diff-minimiser.js 2>/dev/null | wc -l
66
But a lot of those extra lines are context/diff boilerplate:
If we wanted to then filter it down to just added the lines, then we get 13:
⇒ git diff 6d96656142aabcc862846ad7852422f0a8f14dbe:unpacked/_next/static/chunks/vendor.js 6eab9108d9ed3124b4ba757ee9f29e892082deb5:unpacked/_next/static/chunks/vendor.js | ./scripts/ast-poc/diff-minimiser.js 2>/dev/null | grep '^+' | grep -v '^+++' | wc -l
13
⇒ git diff 6d96656142aabcc862846ad7852422f0a8f14dbe:unpacked/_next/static/chunks/vendor.js 6eab9108d9ed3124b4ba757ee9f29e892082deb5:unpacked/_next/static/chunks/vendor.js | ./scripts/ast-poc/diff-minimiser.js 2>/dev/null | grep '^+' | grep -v '^+++'
+ "&dpl=46b7f37021d2244b5b4454b927e0735229de5626"
+ return "?dpl=46b7f37021d2244b5b4454b927e0735229de5626";
+ this.#O.getQueryCache().notify({
+ type: "observerOptionsUpdated",
+ query: this.#A,
+ observer: this,
+ }),
+ this.#O.getMutationCache().notify({
+ type: "observerOptionsUpdated",
+ mutation: this.#Y,
+ observer: this,
+ }),
+//# sourceMappingURL=vendor-4cb082c626a8af6f.js.map
Doing that same example for _app.js
:
⇒ git diff 6d96656142aabcc862846ad7852422f0a8f14dbe:unpacked/_next/static/chunks/pages/_app.js 6eab9108d9ed3124b4ba757ee9f29e892082deb5:unpacked/_next/static/chunks/pages/_app.js | wc -l
14051
⇒ git diff 6d96656142aabcc862846ad7852422f0a8f14dbe:unpacked/_next/static/chunks/pages/_app.js 6eab9108d9ed3124b4ba757ee9f29e892082deb5:unpacked/_next/static/chunks/pages/_app.js | ./scripts/ast-poc/diff-minimiser.js 2>/dev/null | wc -l
2901
⇒ git diff 6d96656142aabcc862846ad7852422f0a8f14dbe:unpacked/_next/static/chunks/pages/_app.js 6eab9108d9ed3124b4ba757ee9f29e892082deb5:unpacked/_next/static/chunks/pages/_app.js | ./scripts/ast-poc/diff-minimiser.js 2>/dev/null | grep '^+' | grep -v '^+++' | wc -l
1003
Personally, I find the 'only newly added' removes too much context and makes it harder to review; so that 2901 view is the type of one I would usually look at. It still has noise in it due to needing to refine the minimiser further; and there are some parts that are technically not noise (as I define it), that someone might find value in also tuning out, which could make it smaller again. But (aside from where huge blocks of code have moved, and currently lose their --color-moved
aspect), I find it quite usable.
The diff I chose there was just the latest one though, and I have often seen _app.js
diffs start off more like 60k+ lines; so obviously with those, the minimised version is still going to be larger than this example.
Here's a fun little hack I just thought up (while the diff minimiser script still breaks --color-moved
):
--color-moved
between the original file, and the original+minimisedDiffPatch fileCurrently this doesn't work fully, as I think I'm not properly updating the line metadata in the diff hunks when I've removed sections from them in the diff minimiser.. but something like this (in theory):
⇒ git diff 6d96656142aabcc862846ad7852422f0a8f14dbe:unpacked/_next/static/chunks/pages/_app.js 6eab9108d9ed3124b4ba757ee9f29e892082deb5:unpacked/_next/static/chunks/pages/_app.js | ./scripts/ast-poc/diff-minimiser.js 2>/dev/null | \
sed 's|a/unpacked/_next/static/chunks/pages/_app.js|a/_app.original.js|g' | \
sed 's|b/unpacked/_next/static/chunks/pages/_app.js|b/_app.patched.js|g' \
> _app.minimised.diff
⇒ git show 6d96656142aabcc862846ad7852422f0a8f14dbe:unpacked/_next/static/chunks/pages/_app.js > _app.original.js
⇒ git apply _app.minimised.diff
⇒ git -c color.diff.oldMoved='white dim' \
-c color.diff.oldMovedAlternative='white dim' \
-c color.diff.newMoved='white dim' \
-c color.diff.newMovedAlternative='white dim' \
-c color.diff.newMovedDimmed='white dim' \
-c color.diff.newMovedAlternativeDimmed='white dim' \
diff --ignore-blank-lines --color-moved=dimmed-zebra --color-moved-ws=ignore-all-space --minimal 6d96656142aabcc862846ad7852422f0a8f14dbe:unpacked/_next/static/chunks/pages/_app.js _app.patched.js
Edit: I wrote a new scripts/fix-diff-headers.js
(https://github.com/0xdevalias/chatgpt-source-watch/commit/f7deec74cca9127bc2f18279d83f7a088f3f35f1) script that seems to get it closer to working..
⇒ ./scripts/fix-diff-headers.js _app.minimised.diff _app.fixed.diff
Reading diff from _app.minimised.diff and writing corrected diff to _app.fixed.diff
Diff successfully corrected and written to _app.fixed.diff
⇒ patch _app.original.js _app.fixed.diff -o _app.patched.js
patching file _app.original.js
6 out of 90 hunks failed--saving rejects to _app.patched.js.rej
⇒ git -c color.diff.oldMoved='white dim' \
-c color.diff.oldMovedAlternative='white dim' \
-c color.diff.newMoved='white dim' \
-c color.diff.newMovedAlternative='white dim' \
-c color.diff.newMovedDimmed='white dim' \
-c color.diff.newMovedAlternativeDimmed='white dim' \
diff --ignore-blank-lines --color-moved=dimmed-zebra --color-moved-ws=ignore-all-space --minimal 6d96656142aabcc862846ad7852422f0a8f14dbe:unpacked/_next/static/chunks/pages/_app.js _app.patched.js
That way, instead of just seeing a giant chunk of 'removed' and a giant chunk of 'added' when this module moved; we can see the dimmed lines which were moved (but otherwise unchanged), and spend less time/effort having to consider them:
An even better version of this would be a diff --color-moved
implementation that was able to ignore identifier names changing entirely (in which case I suspect it would mark this entire block as 'moved but not changed'); but as far as I am currently aware, the only way to do that would be to basically re-implement the whole diff --color-moved
logic ourselves; which would basically end up with us writing our own AST differ I believe.
Though the way I was originally intending on fixing/improving the diff minimiser script was to try just maintaining the ANSI colouring of the input lines, and outputting the lines with the same colouring at the end.
This is an idea for a type of analysis of code diffs. This issue is just for tracking notes and ideas
Example input 1
Here the extraction might be
Of course, this gives you far less information than the original, but I think it could be a good trade-off in cases where you want to look at the diff a little bit but don't have time to see everything.
Example input 2
Since common generic ones like e, t, n, "", and "." would show up frequently in any context, they would have already been seen in the past, and therefore filtered out. You'd more focus on the e.g. "Use Up and Down to choose options", with some kind of convenient way to jump back to see it in-context in the code.
For input like the following:
, none of the names or strings would probably be new, and so you wouldn't see it at all. This is intended, because I can't gleam any conclusions from looking at it, and thus would prefer not to see it
Glenn's comments
https://twitter.com/_devalias/status/1770284997385277554
For large diffs I think it'd be a lot, but strings and names are a subset of the raw diff, so it should still be less work than a full manual analysis. The idea is to just visually filter through them until you see a name/string that looks interesting on their own, which could lead to something good in-context.
Haven't worked with Babel but some relevant docs seem to be
Are there other AST parsers too? Would something like TreeSitter work? I'd generally prefer to avoid node.js if it's not required
I haven't seen enough diffs to exactly anticipate how these would look like but there might be different solutions like color-moved that could work depending on how it goes
Related:
Yeah, I want to make a prototype and see if it will kind of work. I'm still not sure on the implementation, though; the most efficient system might be to integrate with a text editor, which makes it harder to be replicable