bhousel / node-diff3

A node.js library for text diffing and three-way-merge
Other
91 stars 13 forks source link

Incorrect merge boundary placement #9

Closed jakesmo closed 4 years ago

jakesmo commented 6 years ago

I have not had much time to investigate this in further detail, and I believe it may be an edge case that does not come up often in traditional diff3 applications, however, I am providing my findings in case this is unintended behavior.

When running diff3.merge on both fs files and strings containing HTML markup (mostly english sentences), with and without preprocessing to place each tag/word on its own line (thought that may be relevant), the following scenario results in a merge boundary that semantically adds additional text to the merged output instead by placing the text outside of the merge boundaries instead of inside:

Original (ancestor) file contains a sequence of spaced words. File A adds a word to the sequence. File B adds a word in the same place in the sequence, such that the last x letters match the first x letters of the following word.

For example: x=2

O: 'I was touring'
A: 'I was here touring'
B: 'I was into touring'

Result:   'I was <<<<<<<<<here =========in>>>>>>>>>to touring'

In this case, the desired result is 'I was <<<<<<<<<here =========into>>>>>>>>> touring'.

I was able to replicate this for x=1 and x=3.

Could this be because the algorithm greedily examines the next token(s) and thus assumes any minimal match means it is done choosing text to be manually merged? Or perhaps I need to do additional processing on the data to achieve the result I want? Thank you.

bhousel commented 6 years ago

Whoa that is unexpected, I agree it does look like a bug.. Thanks for reporting!

jakesmo commented 6 years ago

I must correct my previous post and say that the value of x is irrelevant. It appears the bug occurs whenever the parser stumbles upon any single letter that matches the first letter of the common word immediately following the merge boundary:

This is a [[[[[[super cool =========func]]]]]]tional test

instead of

This is a [[[[[[super cool =========functional ]]]]]]test

bebbi commented 6 years ago

Hey @bhousel, any progress on this? Or have you got an intuition on where to look?

bhousel commented 4 years ago

Ok so I've been working on this for a few days now.. I needed to rework the code a bunch and add some console logging to finally understand what is going on. Here's a brain dump..

diffIndices result contains the instructions needed to go from buffer1 to buffer2: For a three way merge, we perform this for both O-A and O-B.

diffIndices(O, A):
buffer1: "was touring"
buffer2: "was here touring"
result:
{"buffer1":[4,0],"buffer1Content":"","buffer2":[4,5],"buffer2Content":"here "}

This result means: o: "was touring" o1: insert "here" at position 4: "was here touring"

diffIndices(O, B):
buffer1: "was touring"
buffer2: "was into touring"
result:
{"buffer1":[4,0],"buffer1Content":"","buffer2":[4,2],"buffer2Content":"in"}
{"buffer1":[6,0],"buffer1Content":"","buffer2":[8,3],"buffer2Content":" to"}

This result means: o: "was touring" o1: insert "in" at position 4: "was intouring" o2: insert " to" at position 8: "was into touring"

At this point, neither of these are technically wrong.

diff3MergeRegions figures out the 3-way merge. (It was called "diff3MergeIndices" before, but I just couldn't wrap my head around the inscrutable arrays, and refactored it a bunch to be more javascripty). It processes both of these diffs and determines where the conflicts are.

diff3MergeRegions hunks (length 3):
{"ab":"a","oStart":4,"oLength":0,"abStart":4,"abLength":5,"abContent":"here "}
{"ab":"b","oStart":4,"oLength":0,"abStart":4,"abLength":2,"abContent":"in"}
{"ab":"b","oStart":6,"oLength":0,"abStart":8,"abLength":3,"abContent":" to"}

diff3MergeRegions results: [
  { stable: true, buffer: 'o', bufferStart: 0, bufferLength: 4, bufferContent: 'was ' },
  { stable: false, 
    aStart: 4, aLength: 5, aContent: 'here ', 
    oStart: 4, oLength: 0, oContent: '', 
    bStart: 4, bLength: 2, bContent: 'in' },
  { stable: true, buffer: 'o', bufferStart: 4, bufferLength: 2, bufferContent: 'to' },
  { stable: true, buffer: 'b', bufferStart: 8, bufferLength: 3, bufferContent: ' to' },   // 😭
  { stable: true, buffer: 'o', bufferStart: 6, bufferLength: 5, bufferContent: 'uring' } 
]

so this unfortunately produces: ["was ", "here"/"in", "to", " to", "uring"] The extra " to" from B is treated as a safe insert, as it occurs outside of an unstable (conflict) region.

SO.. It's not technically wrong, but it is definitely doing the wrong thing, and I have to think of how to catch these situations.

bhousel commented 4 years ago

Dropping in here that there are other diffing algorithms than computing the LCS between buffers. Paul Heckel's diff algorithm looks cool: http://documents.scribd.com/docs/10ro9oowpo1h81pgh1as.pdf https://stackoverflow.com/questions/42755035/difficulty-understanding-paul-heckels-diff-algorithm

I don't really want to implement a new algorithm though, so what I'll probably do is check if the input buffers are strings, and if so turn them into arrays by splitting them on whitespace. This Diff3 code is really designed to work on arrays, and it only happens to work on strings because they are iterable and array-like in Javascript.

I feel like that's a reasonable "principle of least astonishment" approach.