Improved treatment of the visual document structure (Submitted by Carsten Pfeiffer)

GoogleCodeExporter commented 9 years ago

(From the mailing list)

Hi,

when comparing text, daisydiff currently disregards the text's structure.
Whether some text is inside a single paragraph or in different tables, a
headline and a block -- daisydiff doesn't care.

This leads to changes being flagged out of their original context, i.e. a
change done to a table cell will displayed in the headline above the table.

The reasons for that is, that DomTreeBuilder flattens all text nodes into a
simple list. RangeDifferencer (with the help of TextNodeComparator) then finds
the longest possible continuous list of changes, even though they should not
be continuous, because they are in a different parent block-element. When
visualizing the differences, all of them would then be inserted into the first
"common-parent", even though they all come from different parent blocks
elements.

I've solved this by letting DomTreeBuilder insert a dummy "separating" node
whenever a block starts or ends. This causes multiple RangeDifferences to be
created (i.e. one for all changes inside one block element) instead of one big
RangeDifference for all of them.

Additionally I improved the detection of "same" nodes in different DOM trees.
I also consider the node's position in the parent node, so that e.g. the
correct table cell (td) is found instead of the very first with the same
attributes.

Cheers,
Carsten

Original issue reported on code.google.com by kkape...@gmail.com on 16 Aug 2010 at 1:16

Attachments:

structure.patch

GoogleCodeExporter commented 9 years ago

Hi Carsten,

I've applied the patch but cannot see any improvements over the pre-patching 
output. Perhaps I had the wrong expectations about the outcome can you please 
advice? I've attached the two sources i'm using for comparison and the 
resulting output.

Original comment by hle.mich...@gmail.com on 19 Nov 2010 at 10:02

Attachments:

GoogleCodeExporter commented 9 years ago

Just to clarify the problem with the output on the previous comment. The diff 
for the Manufacturer and Description sections are scrambled together.

Original comment by hle.mich...@gmail.com on 19 Nov 2010 at 10:14

GoogleCodeExporter commented 9 years ago

Hi Michael,

sorry for coming back to you so late. I just checked your example and you are 
right, this scenario is not covered by my patch. Daisydiff doesn't know or care 
about sections or chapters like e.g. "Manufacturers" from your snippets. 
Daisydiff only sees two long list of text elements and tries to find and report 
as few differences in them as possible.

Daisydiff would need to learn about those chapters (h1..h6 for a start) and 
perform the differencing separately for the text nodes of those chapters.

I.e. DomTreeBuilder should not return a flat list of elements but some kind of 
tree of chapters, each with their own list of text nodes.

Ideally you would then create one TextNodeComparator per chapter and perform 
the diffing chapter-wise. But this needs some work, because this would lead to 
multiple outputs (you want to have just one single result of course).

Cheers,
Carsten

Original comment by pfeif...@kde.org on 14 Mar 2011 at 4:14

GoogleCodeExporter commented 9 years ago

[deleted comment]

GoogleCodeExporter commented 9 years ago

Carsten,

Can you post an example or two, where one can see clearly what has changed 
before and after applying this patch? This should help people understand the 
changes and decided if the patch should be accepted or not.

Original comment by kkape...@gmail.com on 23 May 2011 at 2:40

GoogleCodeExporter commented 9 years ago

I applied the patch (which was tricky because it clashed with the performance 
fixes in 1.2) and I note that it fixes 
http://code.google.com/p/daisydiff/issues/detail?id=26 for me, without breaking 
anything else I could think of.  Awesome work Carsten.

Original comment by don.jp.w...@gmail.com on 15 Jun 2011 at 5:29

GoogleCodeExporter commented 9 years ago

Can you create then a new patch (that is based on 1.2) and attach it here?

Then I will apply this to trunk if you think that it should be part of 
DaisyDiff.

Original comment by kkape...@gmail.com on 15 Jun 2011 at 6:41

GoogleCodeExporter commented 9 years ago

Yes.  Will do.  I have a little more testing to do first.

Original comment by don.jp.w...@gmail.com on 15 Jun 2011 at 6:54

GoogleCodeExporter commented 9 years ago

Here's a simple testcase that my patch fixes. Note that the second table cell 
in the old.html file ("1XXX") must start with the same character with which the 
previous (first) cell ends ("1"). 

In two-way diff mode, daisydiff creates a new bogus column in the table.

In three-way diff mode it even puts the red removal marker into the first 
column, AFAIR.

Original comment by pfeif...@kde.org on 16 Jun 2011 at 10:49

Attachments:

GoogleCodeExporter commented 9 years ago

I will have no time for this in the coming weeks, so I took the time today, 
rebased the patch to trunk and committed it. I'm using it internally for almost 
a year and had no problems with it.

(There are more problems left, but unrelated to this issue).

Original comment by pfeif...@kde.org on 17 Jun 2011 at 3:57

GoogleCodeExporter commented 9 years ago

The SeparatingNode is missing a hashCode method.  Patch provided.

Original comment by don.jp.w...@gmail.com on 24 Jun 2011 at 7:22

Attachments:

hashCode_for_SeparatingNode.patch

GoogleCodeExporter commented 9 years ago

While the SeparatingNode seems to really improve the handling of tables, I'm 
not convinced that the check for paretn position in TagNode.isEquals is a good 
change.  It leads to a lot of cells claiming to be moved when really they've 
just had a row inserted before them.

Eg I find that removing that section improves things.  I'm attaching a patch 
that comments out the part I'm talking about so that people can discuss whether 
they like it or not.

Original comment by don.jp.w...@gmail.com on 24 Jun 2011 at 7:34

Attachments:

comment_out_position_check.patch

GoogleCodeExporter commented 9 years ago

Thanks Don. I committed your patch until I find a better solution. The problem 
(i.e. the reason for the position-check) manifests itself mostly in three-way 
diff, as I found out.

Original comment by pfeif...@kde.org on 4 Jul 2011 at 12:52

GoogleCodeExporter commented 9 years ago

Don,

after commenting out the "parent index" thing in TagNode, the testcases 
"broke", as expected. When I had a look at fixing them, it became obvious that 
it is currently too time consuming to validate them.

So here's my proposal to fix that:
- rename the file expected.xml to expected.html
- make them html markup instead of xml
- add an html header including a reference to the diff.css

That way, one can easily see the expected output (and also validate if the 
expected output make sense at all).

I will then regenerate all expected.html files with the current version. If 
you're fine with that, I will commit that tomorrow.

Original comment by pfeif...@kde.org on 4 Jul 2011 at 4:13

GoogleCodeExporter commented 9 years ago

[deleted comment]

GoogleCodeExporter commented 9 years ago

Hi Carsten,

switching to html files sounds great to me.

Original comment by donny...@gmail.com on 5 Jul 2011 at 9:35

GoogleCodeExporter commented 9 years ago

Committed in r155. Please re-open if you encounter any issues.

Original comment by kkape...@gmail.com on 13 Sep 2011 at 9:05

Changed state: Fixed

GoogleCodeExporter commented 9 years ago

Hi,

I've may have found a fix for Carsten's position checking (which brings too 
many drawbacks with it). I simply added more separators by commenting out a 
couple of lines of code - a simple patch and I'm attaching it to this issue. I 
ran the new code against the file-based unit tests and while the code does 
perform worse in a few cases it is a huge improvement overall. I believe it 
also solves issue 44.

I went over all unit tests which have 'failed' after adding the new code and my 
tally goes like this:

Big improvements in Lists 22, 23, 35; General 5, 6, 35
Small improvements in General 9

Big regression in General 38
Slight regressions in Lists 6, 19 (bad HTML!), 20 (bad HTML!), 31; Paragraphs 
43; General 39 (though Lists 19 and 20 contain bad HTML, UL elements cannot be 
nested in other UL elements unless they are wrapped in a LI first - if I fix 
this the result with new code isn't bad)

Couldn't decide: Lists 4, 5, 17, 18; General 42

I would appreciate if somebody could take the time to review these tests. To do 
this, apply the patch, run 'ant test' and go to the 'testdata' directory to 
review the test results: failed tests will have the '_actual_result.html' file 
present in their directory (not that failure here simply means a different 
output, not a failure per se).

Original comment by gre...@gmail.com on 3 Sep 2012 at 8:33

Attachments:

more-separators.diff

GoogleCodeExporter commented 9 years ago

Hi Gregor,

thanks a bunch for your efforts. The improvements of your patch do indeed look 
good to me. The regressions in lists however are a bad. It would be awesome if 
we could find a way to get the improvements without the regressions :-)

Can you tell me how you fixed the badly nested ULs so that the results are OK? 
Adding a <li> in front of the nested <ul> changes the behavior (you get another 
bullet point). Even though it's not correct html to leave out the <li>, we have 
to support that, because it is quite common. For example the Mozilla HTML 
Commposer (the rich text editor in Thunderbird for example) creates such HTML.

Regarding issue 44 -- I couldn't reproduce that problem at all (i.e. plain 
1.2). The result was the same (perfect indentation) with and without your patch.

Thanks,
Carsten

Original comment by pfeif...@kde.org on 19 Sep 2012 at 8:58

PiRSquared17 / daisydiff

Improved treatment of the visual document structure (Submitted by Carsten Pfeiffer) #28