andrewdvsmith / daisydiff

Automatically exported from code.google.com/p/daisydiff
0 stars 0 forks source link

Daisy Diff Report output is different in table tags #8

Open GoogleCodeExporter opened 8 years ago

GoogleCodeExporter commented 8 years ago
What steps will reproduce the problem?
1. Compare the attached two HTMLs (test1.html,test2.htm)
2. Its having a table with 3 rows. there is only one difference in the 3rd 
row.
3. But the diff report shows that there is a deletion at the second row.

What is the expected output? What do you see instead?
Expected output in diff report
"hai1" should be in 1st row is fine
"hai2" should be in 2nd row but it has show in the 3rd row
"hai3" should be in 3rd row with strikeout but it had been shown in 2nd row
"hai4" is the new addition that is fine.

What version of the product are you using? On what operating system?
its latest build taken on 12-April-09, windows vista

Please provide any additional information below.
Here with attached the test htmls & daisy diff report.

Original issue reported on code.google.com by arrama...@gmail.com on 12 Apr 2009 at 11:44

Attachments:

GoogleCodeExporter commented 8 years ago
Table comparison is almost non-usable for any real table work. 
The problem above is only a top of an iceberg.
The problem above is caused by semantic comparison of the text parents rather 
than a 
real check whether the parent is really common. 
In a big table many cell elements will be alike HTML-tag wise. Probably whole 
column 
(or maybe many columns) will have cells with similar formatting. This makes 
them 
impossible to tell apart using current methods. In some cases it might be 
impossible 
by any means.
My company works with a lot of financial tables that are displayed in HTML.
This tool (DaisyDiff) was evaluated as the best HTML diff tool among currently 
available (not only among free tools) during our research. 
However, even DaisyDiff doesn't digest intense table editing well.
Besides of the "wrong row" display stated here originally, we get invalid HTML 
in 
some cases.
One of the cases is filed under ID 11, because it's not only for tables.
Often the structure of a table becomes broken, as the "splitting" doesn't take 
into 
the account that the amount of columns in other rows should be updated as well.
All in all with real table change the result of the comparison is not very 
useful 
(see the attached files. Notice, that the table for this sample is quite small. 
In 
the reality we have to deal with much bigger tables).

For that reason I have an enhancement proposal for displaying table difference, 
which I will try to implement if  you don't have other better suggestions about 
how 
to handle this.
(see the next comment)

Original comment by anastass...@businesswire.com on 14 Apr 2009 at 8:30

Attachments:

GoogleCodeExporter commented 8 years ago
Enhancement proposal: Table difference.
---------------------------------------
General overview (see some of the details in the attached document)

1. DaisyDiff handles "changed" modifications well, so the processing will only 
differ if "removed" TextNodes are found withing a table. (Because the result is 
built out of the "new" version the "added" TextNodes aren't actually moved 
anywhere, 
so they are fine, and unless we find "removed" within a table we are not 
touching "added") The check for being inside the table will happen 
during "markAsDeleted" processing.

2. Global check for table difference happens:
    a. Do the tables have common content at all?
      -- if no common content is found, both tables displayed one after another, 
with 2 difference notes: "table was removed" "table was added" or something 
like 
that.
    b. If tables have common content - do they have same amount of rows/columns?
      -- If amount of rows/columns stayed the same then the difference result should 
orient on where to put the changes based on the cell coordinates(not as simple 
as 
this sounds due to the "colspan" and "rowspan" attributes). No new cells/rows 
will 
be added in this case.
    c. If tables have common content, but didn't keep the dimension the same the 
added/removed columns/rows will be displayed separately, which means the 
resulting 
table might have different amount of rows/columns than any of the originals.

here is how (approximately) the differences will be listed (for 
the "table_suggested_result.htm" attached file in the previous comment):
/*Global changes first*/
diff1: Table dimensions changed: 
  -- amount of rows was increased from 6 to 9
  -- amount of columns was reduced from 5 to 3
/*Descending to row level, sticking it to row addition/removal.
  Row is considered "removed" if no text was kept.
  With an entire row removal no separate word removal messages*/
diff2: Row "Candy Sale" with attributes such-such was removed 
diff3: Row "Candy Sale $ZZZ.zz" with attributes such-such was added
diff4: Row "Total $ZZ.zz $ZZZ.zz" with attributes such-such was removed
diff5: Row "Dairy $...." with attributes such-such was added
diff6: Row "Milk $..." with attributes such-such was added
diff7: Row "Fruits/Veggies $..." with attributes such-such was added
diff8: Row "...etc..." with attributes such-such was removed
/*Going by columns if it can be determined, that the whole column was removed.
  The column is considered removed if no text from all non-spanned cells was kept*/
diff9: Column "Last quarter $XX.xx $YY.yy $ZZ.zz" was removed.
diff10: Column with no text was removed - can this be determined?
/*Moving to individual cell differences*/
diff11: "Year" was added
diff12: "Result" was moved from cell with such-such attr, to cell with 
such-such attr
diff13: "Candy kind" was removed
diff14: "Category" was added"
diff15: "2008" was moved from such HTML context to such HTML context
diff16: "total" was removed

Original comment by anastass...@businesswire.com on 14 Apr 2009 at 10:02

Attachments:

GoogleCodeExporter commented 8 years ago
Fix for the case mentioned in the original (first) post in this thread - in the 
attachments

Original comment by anastass...@businesswire.com on 15 Apr 2009 at 6:26

Attachments:

GoogleCodeExporter commented 8 years ago
You are 100% correct in that the current DaisyDiff is unable to handle complex 
table
changes, such as the ones you describe. A better heuristic specific to tables is
needed to fix this. The heuristic you propose sounds very good.

I suggest I give you commit access to apply these patches?

Original comment by guy...@gmail.com on 17 Apr 2009 at 10:04

GoogleCodeExporter commented 8 years ago

Original comment by anastass...@businesswire.com on 25 Apr 2009 at 12:23

GoogleCodeExporter commented 8 years ago
Having copied the two attachd files (Node.java and TagNode.java) on top of 1.2 
trunk code, my output still seems not quite right.

java -jar daisydiff.jar old.html new.html

Could you please help me solve the issue with this Table diff?
I hope I should get the same result as *suggested*htm file, where old TD 
content displays striked and new in green

Original comment by nitin...@gmail.com on 1 May 2012 at 12:43

Attachments: