utterances-bot commented 2 years ago

Difftastic, the Fantastic Diff – Wilfred Hughes::Blog

programming, language design, and human factors

http://www.wilfred.me.uk/blog/2022/09/06/difftastic-the-fantastic-diff/

trishume commented 2 years ago

This is really cool! I noticed that this was the approach I used when I did tree diffing, and sure enough looks like this was inspired by autochrome which was inspired by my post.

I'm curious exactly why A failed here. It worked great for me, when used with a good heuristic. Lazy node construction worked really well with A in that in practice only near the diagonal of the massive square graph was constructed. I'm guessing it might have been complicated to design a good heuristic with an expanded move set? Although "the remaining diff must must be at least as long as the longest remaining side" seems pretty robustly applicable. I see autochrome had to abandon A* and has an explanation of why, but that explanation shouldn't apply to difftastic I think, or does it also try to handle reordering of top-level pieces?

tomarrell commented 2 years ago

This is brilliant, makes diffing Kube manifests significantly easier. Very impressed at how much easier it makes reviewing Go code to be as well.

It definitely feels like a diff which is much more similar to how you would approach refactoring the same code. Nicely done.

plops commented 2 years ago

Nice! How can this be used with git. I always get scared when I have to perform a git merge of a common lisp file because it is very hard to not loose any parens.

Wilfred commented 2 years ago

@trishume wow, great to hear from you! I read your blog post several times during the development of difftastic :).

I agree A* ought to be better than Dijkstra, but every time I've prototyped it, I saw performance get worse. My most recent attempt is here, where you can see the heuristic I used.

My theories as to why this is:

I'm using a radix heap in the Dijkstra implementation, whereas the A implementation uses the binary heap. The radix heap is faster, but it requires monotonically increasing items which I can't do in A.
My cost model for edges means that Dijkstra mostly explores matching nodes. A single list could have a cost of 40 for UnchangedNode, whereas if it's completely novel it requires NovelDelimiter (cost of 300) and several NovelAtom (cost of 300) edges.

I also explored some other graphing algorithms with even less success.

Iterative deepening (significantly slower)
Bidirectional Dijkstra (painful to implement correctly because the first node in the middle isn't necessary on the shortest path)
Chunking, where I stopped the graph search on the best node found according to the A* heuristic, and restarted the search (produced significantly longer paths so worse diffs)

Wilfred commented 2 years ago

@plops you can use difftastic with git: https://difftastic.wilfred.me.uk/git.html

jmcq commented 2 years ago

Nice. You wrote: "In this example, (foo (novel) (bar)) would be a totally valid, minimal diff." with novel and the 4th and 5th parentheses in green to signify insertion. But if novel were inserted after the second parenthesis, the result would be (foo (novelbar)), so the right paren that it added must follow "novel". Also you did not account that there is one space before but two after.

Would inserting "(novel) " be a lesser edit?

Perhaps a different example would be useful here?

agzam commented 2 years ago

Can you use it in Emacs, for things like diff-current-buffer-with-file? Can it be integrated to work with Magit?

benjaminy commented 2 years ago

Cool work.

Here's a related idea I've been working on that I'll just toss out into the æthernets to see if someone else takes it in an interestingly different direction.

I'm trying to get good merging in normal team branch-n-merge workflows for documents with complex structure. "Git databases" like Dolt and Liquidbase get you in the right ballpark. (But the documents I'm working with are more like really complicated diagrams with hierarchy and references and all manner of nastiness.)

The approach that I've taken is to make a repo-global unique ID for every "thing" (hard part: doing this efficiently with zero coordination between distributed users). This makes diffs and merges for committed versions pretty darn easy; almost trivial. But aligning the next save/commit with the repo IDs is tricky.

I'm not working in the context of programming language text, but I think similar ideas could be applied there. An interesting challenge that isn't relevant in my problem's context is the difference between abstract and concrete syntax. For good long-range diffing and merging I think you really want to focus on abstract syntax. But of course one works with code in concrete syntax. So I think you need to store both and merge them semi-independently.

Hefaistos68 commented 2 years ago

One thing that bugs me in pretty much all diff tools is that they are not able to interpret moved code parts, especially useful when refactoring. Seems that difftastic doesnt either.

benjaminy commented 2 years ago

@Hefaistos68 Detecting order changes is hard. The fundamental compromise in pretty much all practical diff algorithms is only considering the possibility of insertions, deletions and unit modifications. That reduces the space of possible diffs from factorial to quadratic. But it comes at the cost of making it impossible to consider order changes.

I don't know if this is a trick anyone has tried, but it might be cool... Hash all the subtrees and allow for the possibility of order changes only when there are exact hash matches. I think this would be far too brittle with concrete syntax. (One whitespace change would break hash equality.) But it might be reasonably efficient and robust with abstract syntax, especially if you engineered your abstract syntax encoding to be as "stable" as possible.

Hefaistos68 commented 2 years ago

My guess it really works only when considering subtrees, and even then it must be context sensitive. Most diffs that are somewhat capable of showing moved items dont consider the context and then its just a text block comparism. Mostly all between the original location and the new location of the first found block (subtree) is considered a change, totally wrong since the all inbetween might be a subtree on its own which didnt change location in its context. Complicated topic.

QuarticCat commented 2 years ago

Hi, I would like to try some optimizations. You mentioned that you tried A* and it was slower. So could you upload the benchmarks you used so that other contributors don't have to write their own one?

Wilfred commented 2 years ago

@agzam I know of some users using difftastic with magit, check out https://tsdh.org/posts/2022-08-01-difftastic-diffing-with-magit.html

@quarticcat There are some example benchmarks in the manual: https://difftastic.wilfred.me.uk/contributing.html#profiling

Wilfred / wilfred.github.com

Difftastic, the Fantastic Diff – Wilfred Hughes::Blog #12

Difftastic, the Fantastic Diff – Wilfred Hughes::Blog