Sol110 / google-diff-match-patch

Automatically exported from code.google.com/p/google-diff-match-patch
Apache License 2.0
0 stars 0 forks source link

Potential improvement to semantic cleanup? #60

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?
1. Go to: 
http://neil.fraser.name/software/diff_match_patch/svn/trunk/demos/demo_diff.html
2. Enter in the following text for version 1:

The centrifugal moult is modified in the tail feathers. The general pattern 
seen in passerines is that the primaries are replaced outward, secondaries 
inward, and the tail from center outward. 

3. Enter the following text for version 2:

The centrifugal moult is modified in the tail feathers. The greater primary 
coverts are moulted in synchrony with the primary that they overlap. The 
general pattern seen in passerines is that the primaries are replaced outward, 
secondaries inward, and the tail from center outward. 

What is the expected output? What do you see instead?

The output is correct. However, from a human / semantic point of view the 
algorithm picks the wrong "The" to be the inserted one. Looking at the output I 
would have expected that the second "The" in version 2 would be the inserted 
one, not the third "The".

Would a potential semantic cleanup operation be to try to match the last 
character of the first equal run with the last character of the inserted run 
and swap them around - and repeat while they are the same? 

What version of the product are you using? On what operating system?
This version:
http://neil.fraser.name/software/diff_match_patch/svn/trunk/demos/demo_diff.html

Please provide any additional information below.

Original issue reported on code.google.com by m...@dixon.se on 11 Nov 2011 at 2:07

GoogleCodeExporter commented 9 years ago
That's odd, diff_cleanupSemanticScore already has a weighting for hugging 
punctuation.  It should already be doing exactly what you suggest.  I'll look 
into it.  Thanks!

Original comment by neil.fra...@gmail.com on 11 Nov 2011 at 5:47

GoogleCodeExporter commented 9 years ago
Revision 100 should resolve this issue.

Cool bug, thank you again!

Original comment by neil.fra...@gmail.com on 18 Nov 2011 at 12:15