Open GoogleCodeExporter opened 9 years ago
Table comparison is almost non-usable for any real table work.
The problem above is only a top of an iceberg.
The problem above is caused by semantic comparison of the text parents rather
than a
real check whether the parent is really common.
In a big table many cell elements will be alike HTML-tag wise. Probably whole
column
(or maybe many columns) will have cells with similar formatting. This makes
them
impossible to tell apart using current methods. In some cases it might be
impossible
by any means.
My company works with a lot of financial tables that are displayed in HTML.
This tool (DaisyDiff) was evaluated as the best HTML diff tool among currently
available (not only among free tools) during our research.
However, even DaisyDiff doesn't digest intense table editing well.
Besides of the "wrong row" display stated here originally, we get invalid HTML
in
some cases.
One of the cases is filed under ID 11, because it's not only for tables.
Often the structure of a table becomes broken, as the "splitting" doesn't take
into
the account that the amount of columns in other rows should be updated as well.
All in all with real table change the result of the comparison is not very
useful
(see the attached files. Notice, that the table for this sample is quite small.
In
the reality we have to deal with much bigger tables).
For that reason I have an enhancement proposal for displaying table difference,
which I will try to implement if you don't have other better suggestions about
how
to handle this.
(see the next comment)
Original comment by anastass...@businesswire.com
on 14 Apr 2009 at 8:30
Attachments:
Enhancement proposal: Table difference.
---------------------------------------
General overview (see some of the details in the attached document)
1. DaisyDiff handles "changed" modifications well, so the processing will only
differ if "removed" TextNodes are found withing a table. (Because the result is
built out of the "new" version the "added" TextNodes aren't actually moved
anywhere,
so they are fine, and unless we find "removed" within a table we are not
touching "added") The check for being inside the table will happen
during "markAsDeleted" processing.
2. Global check for table difference happens:
a. Do the tables have common content at all?
-- if no common content is found, both tables displayed one after another,
with 2 difference notes: "table was removed" "table was added" or something
like
that.
b. If tables have common content - do they have same amount of rows/columns?
-- If amount of rows/columns stayed the same then the difference result should
orient on where to put the changes based on the cell coordinates(not as simple
as
this sounds due to the "colspan" and "rowspan" attributes). No new cells/rows
will
be added in this case.
c. If tables have common content, but didn't keep the dimension the same the
added/removed columns/rows will be displayed separately, which means the
resulting
table might have different amount of rows/columns than any of the originals.
here is how (approximately) the differences will be listed (for
the "table_suggested_result.htm" attached file in the previous comment):
/*Global changes first*/
diff1: Table dimensions changed:
-- amount of rows was increased from 6 to 9
-- amount of columns was reduced from 5 to 3
/*Descending to row level, sticking it to row addition/removal.
Row is considered "removed" if no text was kept.
With an entire row removal no separate word removal messages*/
diff2: Row "Candy Sale" with attributes such-such was removed
diff3: Row "Candy Sale $ZZZ.zz" with attributes such-such was added
diff4: Row "Total $ZZ.zz $ZZZ.zz" with attributes such-such was removed
diff5: Row "Dairy $...." with attributes such-such was added
diff6: Row "Milk $..." with attributes such-such was added
diff7: Row "Fruits/Veggies $..." with attributes such-such was added
diff8: Row "...etc..." with attributes such-such was removed
/*Going by columns if it can be determined, that the whole column was removed.
The column is considered removed if no text from all non-spanned cells was kept*/
diff9: Column "Last quarter $XX.xx $YY.yy $ZZ.zz" was removed.
diff10: Column with no text was removed - can this be determined?
/*Moving to individual cell differences*/
diff11: "Year" was added
diff12: "Result" was moved from cell with such-such attr, to cell with
such-such attr
diff13: "Candy kind" was removed
diff14: "Category" was added"
diff15: "2008" was moved from such HTML context to such HTML context
diff16: "total" was removed
Original comment by anastass...@businesswire.com
on 14 Apr 2009 at 10:02
Attachments:
Fix for the case mentioned in the original (first) post in this thread - in the
attachments
Original comment by anastass...@businesswire.com
on 15 Apr 2009 at 6:26
Attachments:
You are 100% correct in that the current DaisyDiff is unable to handle complex
table
changes, such as the ones you describe. A better heuristic specific to tables is
needed to fix this. The heuristic you propose sounds very good.
I suggest I give you commit access to apply these patches?
Original comment by guy...@gmail.com
on 17 Apr 2009 at 10:04
Original comment by anastass...@businesswire.com
on 25 Apr 2009 at 12:23
Having copied the two attachd files (Node.java and TagNode.java) on top of 1.2
trunk code, my output still seems not quite right.
java -jar daisydiff.jar old.html new.html
Could you please help me solve the issue with this Table diff?
I hope I should get the same result as *suggested*htm file, where old TD
content displays striked and new in green
Original comment by nitin...@gmail.com
on 1 May 2012 at 12:43
Attachments:
Original issue reported on code.google.com by
arrama...@gmail.com
on 12 Apr 2009 at 11:44Attachments: