ValorLin / google-diff-match-patch

Automatically exported from code.google.com/p/google-diff-match-patch
Apache License 2.0
0 stars 0 forks source link

Is it possible to do word by word comparison? #21

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
This is a great piece of code. I have one concern on the compariosn. Is it 
possible to chnage the logic in such a way that it does word by word 
comparison? 

I have written following test code in java
========================================================================
String text1="Hello how are you. <br/> My name is Prathyusha. Shravanthi 
is my friend.";
String text2="Hello. <br/>My name is Shravanthi. Prathyusha was my 
friend.";
diff_match_patch diff = new diff_match_patch(); 
LinkedList<diff_match_patch.Diff> diffs = diff.diff_main(text1, text2);
diff.diff_cleanupSemantic(diffs);
String result = diff.diff_prettyHtml(diffs);
System.out.println(result);
=====================================================================

I see the following output
--------------------------------------------------------
Hello<DEL> how are you</DEL>. <br/><DEL> </DEL>My name is <DEL>Prathyusha. 
Shravanthi i</DEL><INS>Shravanthi. Prathyusha wa</INS>s my friend.
--------------------------------------------------------

As the output shows the logic doesn't break the differences into logical 
words. Rather it does comparison on a chunk of string. Word by word 
comparison would help in getting a precise count of the newly added words 
and the deleted words. Additinally if we check the deleted 
<DEL>Prathyusha. Shravanthi i</DEL> text and the inserted <INS>Shravanthi. 
Prathyusha wa</INS> text, the middle chars ' ' and '.' are common. They 
should not be considered as deleted and inserted. This problem wouldn't 
have arised if we do a word by word comparison. The word count in all is 
increased by 4 conidering ' ' and '.' as 2 words.
Is it possible to do a word by word comparison?

Regards,
Pratap

Original issue reported on code.google.com by pratapma...@yahoo.com on 4 Jul 2009 at 1:51

GoogleCodeExporter commented 9 years ago
Word-by-word diffing is not a native function of this library.  However, it is 
easy 
to do.  You need to break your text into words (how you define a word is a more 
interesting problem than you might think), create a lookup table of Unicode 
characters to words, build two strings made up of the Unicode characters 
associated 
with each word, Diff those two strings, then convert the diff back into the 
text.

Sounds complicated, but it's not -- because the code has already been written 
for 
you.  Just look at the diff_linesToChars and diff_charsToLines functions.  Copy 
them 
and make them split on words instead of characters.  Then your code will just 
be:

  Object b[] = diff_wordsToChars(text1, text2);
  String wordText1 = (String) b[0];
  String wordText2 = (String) b[1];
  wordarray = (ArrayList<String>) b[2];
  LinkedList<Diff> diffs = diff_main(wordText1, wordText2, false);
  diff_charsToWords(diffs, wordarray);

Have fun defining what a "word" is.  Been there, done that on another project.  
:)

Original comment by neil.fra...@gmail.com on 4 Jul 2009 at 2:24

GoogleCodeExporter commented 9 years ago
Thanks for your input Neil. I will try that.

Original comment by pratapma...@yahoo.com on 6 Jul 2009 at 5:24

GoogleCodeExporter commented 9 years ago
I need this enhancement too:)

Original comment by chunhaic...@gmail.com on 17 Apr 2010 at 12:31

GoogleCodeExporter commented 9 years ago
I implemented the word-by-word (yes, is was easy), and it does a pretty good 
job just
tokenizing spaces and newlines.
Something like:
  wordEndSpace = text.indexOf(' ', wordStart);
  wordEndNewline = text.indexOf('\n', wordStart);
  wordEnd = Math.min(wordEndSpace, wordEndNewline);    
That will do the trick effectively. You could of course do an nicer array 
version, if
you have more matches (punctuation etc). Or perhaps regexp as well.
I guess the reason the simple version works well for me, is that the text is
preprocessed (from HTML) is a rather cool way, so whitespace in the text 
matches HTML
rendering quite close.

Thanks for a great little piece of code, Niel.

Regards,
Mads Buus Westmark

Original comment by madsbuus...@gmail.com on 20 Apr 2010 at 11:45

GoogleCodeExporter commented 9 years ago
I need this feature also.

Original comment by g33.ad...@gmail.com on 25 Jun 2011 at 11:26

GoogleCodeExporter commented 9 years ago
[deleted comment]
GoogleCodeExporter commented 9 years ago
[deleted comment]
GoogleCodeExporter commented 9 years ago
hi,
diff_linesToChars  functions having LinesToCharsResult as a return type. 
Is there any changes required for diff_wordsToChars() ?

 Object b[] = diff_wordsToChars(text1, text2);
  String wordText1 = (String) b[0];
  String wordText2 = (String) b[1];
  wordarray = (ArrayList<String>) b[2];
  LinkedList<Diff> diffs = diff_main(wordText1, wordText2, false);
  diff_charsToWords(diffs, wordarray);

Using the diff-match-path class,we can get the character comparison not a word 
comparison. what are all the changes required for the Word comparison?

Thanks for advance

Original comment by monigov...@gmail.com on 17 Nov 2011 at 7:06