eclipse-platform / eclipse.platform

https://eclipse.dev/eclipse/
Eclipse Public License 2.0
81 stars 112 forks source link

Make Text Compare more smart #66

Open laeubi opened 2 years ago

laeubi commented 2 years ago

I recently noticed that text-compare seems to present a undesirable changeset if the following happen:

  1. You have some repeated code block with common start (e.g. <dependency> in a pom.xml or @Test in a Unit test) and ending (e.g. </dependency> in a pom.xml or } in a Unit test
  2. You insert a new block in between
  3. Now the change set spans into the next block

grafik

What would be expected is that the whole block is marked as an insertion (here line 74 - 78)

Github for example displays this more nice

grafik

vogella commented 2 years ago

That would be an awesome enhancement, I moved the issue to team repo, as this is there compare is located.

jukzi commented 1 year ago

@laeubi would you please test the proposed fix if it fits your request?

jukzi commented 1 year ago

won't fix

laeubi commented 1 year ago

@jukzi why won't fix?

jukzi commented 1 year ago

well, no one seems to take the time to implement or even test.

laeubi commented 1 year ago

I'm a bit short of time atm, but even if not two days are really not an indication of no interest, I think its still worth and your code should not be lost ...

Bananeweizen commented 1 year ago

The diff algorithm is based on finding longest common subsequences (LCS) between both files. By design, that will always make the common (unchanged) parts as large as possible. To make the result more human readable, the changes created by the LCS would have to be post-processed to create more balanced (instead of longest) common subsequences between the changes. A simple heuristic would be to modify all changes where there are identical lines at the end of the change and immediately before the change, and where the first line of a change has different indentation than the last line, and where that does not overlap with another change. I'll check if I can get that working...

sratz commented 1 year ago

The new diff algorithm introduced in #323 does not really solve the problem.

It just got lucky for that one scenario.

E.g. compare the following scenarios:

1) 6 lines added Before: image After: image

2) 5 lines added + additional whitespace line Before: image After: image

3) 5 lines added without additional whitespace line Before: image After: image

So it's really just very specific situations that benefit from the balancing algorithm. In (most) others, it doesn't bring the desired result.

I propose to revert #323 for now seeing as this has some side effects (#458) and also the algorithm is quite expensive for very large diffs with repeating lines.

The whole diff calculator probably needs to be re-designed somehow.

jukzi commented 2 months ago

Seems like line based diff like LCS is unsuitable for tree structure documents. There are more advanced algorithms for xml. See for example: X-Diff: An Effective Change Detection Algorithm for XML