Sub-optimal output from diff

GoogleCodeExporter commented 8 years ago

What steps will reproduce the problem?
1. Ran the following Python code:  

  from diff_match_patch import diff_match_patch
  dmp = diff_match_patch()

  text1="A.A.F.z.z"
  text2="x.x.F.A.A.F"

  diff = dmp.diff_main( text1, text2, False )
  print diff

What is the expected output? What do you see instead?

Expected:
  [(1, 'x.x.F.'), (0, 'A.A.F'), (-1, '.z.z')]

Actual:
  [(-1, 'A'), (1, 'x'), (0, '.'), (-1, 'A'), (1, 'x'), (0, '.F.'), (-1, 'z'), (1, 'A.A'), (0, '.'), (-1, 'z'), (1, 'F')]

What version of the product are you using? On what operating system?
Latest diff-match-patch checked out from SVN (rev 97). Mac OS X.

Please provide any additional information below.
The expected output seems easier to read. The actual output has the same length 
D-path (5) as the expected output and Myers' algorithm seems to be finding that 
one first. So I'm not sure if this is a bug in diff-match-patch or a 
performance trade-off?

Original issue reported on code.google.com by m...@dixon.se on 3 Nov 2011 at 1:19

GoogleCodeExporter commented 8 years ago

Behaviour confirmed.  As you pointed out, the two options are tied as the 
shortest possible solution.  In this case it happens it finds the ugly one 
first.

If you want an output that is easy to read, the intent is to pass the raw diff 
to diff_cleanupSemantic() which has some reasonably effective heuristics to 
clean up the results.  However in this case it fails to detect the commonality. 
 It is worth noting that if one reverses text1 and text2 it behaves perfectly.

The issue appears to be an oversight in diff_cleanupSemantic().  Currently it 
checks for this case:
# Find any overlaps between deletions and insertions.
# e.g: <del>abcxxx</del><ins>xxxdef</ins>
#   -> <del>abc</del>xxx<ins>def</ins>

The case of
<del>xxxabc</del><ins>xxxdef</ins>
and
<del>abcxxx</del><ins>defxxx</ins>
is already handled by the diff algorithm itself.

What's missing is this heuristic:
<del>xxxabc</del><ins>defxxx</ins>
 -> <ins>def</ins>xxx<del>abc</del>

Working on it...
[Nice bug report BTW.  You know your stuff.]

Original comment by neil.fra...@gmail.com on 3 Nov 2011 at 6:13

Changed state: Started

GoogleCodeExporter commented 8 years ago

Code complete.  Out for review...

Original comment by neil.fra...@gmail.com on 4 Nov 2011 at 2:06

GoogleCodeExporter commented 8 years ago

...complete.  Thanks for pointing out this sub-optimal output!

Original comment by neil.fra...@gmail.com on 9 Nov 2011 at 9:33

Changed state: Fixed

gbrault / google-diff-match-patch

Sub-optimal output from diff #57