Rohland / htmldiff.net

Html Diff algorithm for .NET
MIT License
288 stars 83 forks source link

html table comparison issues #39

Open frostless opened 4 years ago

frostless commented 4 years ago

Firstly, this is a great library and it fits natively fit in our dev requirment, and it does make my life easier. But it seems that the algorithms being used does not handle the complex html structure (in my case table) very well. During internal testing I came across quite a few instances where the diff shown on the table does not make sense. Problems include but not limited to

  1. the deleted cell is shown on the wrong row
  2. some existing css has been deleted 3.the header of the table become shorter in lengh 4: the table layout breaks

Not sure if there is any parameter or tuning options that I can leverage to better diff the table. ( I saw quite a few similar issues remain opened in this repo) There only seems to be limited control exposed to be clients like OrphanMatchThreshold, however while the changing of it makes different on the text, it does not seem to change the output of table at all. Here is one of the simple example. The original table image: b4 The html code:

<table id="OutputIssues" class="table" style="width: 645px; border-collapse: collapse;" border="1" rules="all"
    cellspacing="0" cellpadding="3">
    <thead>
        <tr>
            <td style="width: 355px;" align="left"><span style="text-decoration: underline;">New header name</span></td>
            <td style="width: 50px;" align="center"><span
                    style="font-family: arial, helvetica, sans-serif;">Rating</span></td>
            <td style="width: 90px; white-space: nowrap;" align="left">&nbsp;</td>
            <td style="width: 50px;" align="center"><span style="font-size: 10pt;">Status</span></td>
        </tr>
    </thead>
    <tbody>
        <tr style="border-bottom: none;" valign="top">
            <td>TAC: Evidence against 4 key items <br />
                <div style="margin-top: 1em;"><span style="text-decoration: underline;"><strong>Actions:</strong></span>
                </div>
            </td>
            <td align="center">Med</td>
            <td>&nbsp;</td>
            <td style="white-space: nowrap;" align="center"><span
                    id="ctl05_OutputIssues_ctl06_lblActionStatus">Open</span></td>
        </tr>
        <tr style="border-top: none; border-bottom: none;" valign="top">
            <td style="padding-top: 0px; padding-bottom: 3px;">
                <ul style="margin-top: 0px; margin-bottom: 0px;">
                    <li>Fair price, value for money modifications</li>
                </ul>
            </td>
            <td style="padding-top: 0px; padding-bottom: 3px;" align="center">&nbsp;</td>
            <td style="padding-top: 0px; padding-bottom: 3px;">&nbsp;</td>
            <td style="white-space: nowrap; padding-top: 0px; padding-bottom: 3px;" align="center"><span
                    id="ctl05_OutputIssues_ctl07_lblActionStatus">Open</span></td>
        </tr>
        <tr style="border-top: none; border-bottom: none;" valign="top">
            <td style="padding-top: 0px; padding-bottom: 3px;">
                <ul style="margin-top: 0px; margin-bottom: 0px;">
                    <li>Increased client satisfaction</li>
                </ul>
            </td>
            <td style="padding-top: 0px; padding-bottom: 3px;" align="center">&nbsp;</td>
            <td style="padding-top: 0px; padding-bottom: 3px;">&nbsp;</td>
            <td style="white-space: nowrap; padding-top: 0px; padding-bottom: 3px;" align="center"><span
                    id="ctl05_OutputIssues_ctl08_lblActionStatus">Open</span></td>
        </tr>
        <tr style="border-top: none; border-bottom: none;" valign="top">
            <td style="padding-top: 0px; padding-bottom: 3px;">
                <ul style="margin-top: 0px; margin-bottom: 0px;">
                    <li>Reducing interim accommodation costs</li>
                </ul>
            </td>
            <td style="padding-top: 0px; padding-bottom: 3px;" align="center">&nbsp;</td>
            <td style="padding-top: 0px; padding-bottom: 3px;">&nbsp;</td>
            <td style="white-space: nowrap; padding-top: 0px; padding-bottom: 3px;" align="center"><span
                    id="ctl05_OutputIssues_ctl09_lblActionStatus">Open</span></td>
        </tr>
        <tr style="border-top: none; border-bottom: none;" valign="top">
            <td style="padding-top: 0px; padding-bottom: 3px;">
                <ul style="margin-top: 0px; margin-bottom: 0px;">
                    <li>Reducing the end to end process time</li>
                </ul>
            </td>
            <td style="padding-top: 0px; padding-bottom: 3px;" align="center">&nbsp;</td>
            <td style="padding-top: 0px; padding-bottom: 3px;">&nbsp;</td>
            <td style="white-space: nowrap; padding-top: 0px; padding-bottom: 3px;" align="center"><span
                    id="ctl05_OutputIssues_ctl10_lblActionStatus">Open</span></td>
        </tr>
        <tr valign="top">
            <td>TAC: Document examples - Project Plan Acceptance form</td>
            <td align="center">Med</td>
            <td>&nbsp;</td>
            <td style="white-space: nowrap;" align="center"><span
                    id="ctl05_OutputIssues_ctl18_lblActionStatus">Open</span></td>
        </tr>
        <tr valign="top">
            <td>TAC: Need some idea of fees to be charged for example project</td>
            <td align="center">Med</td>
            <td>&nbsp;</td>
            <td style="white-space: nowrap;" align="center"><span
                    id="ctl05_OutputIssues_ctl19_lblActionStatus">Open</span></td>
        </tr>
    </tbody>
</table>

After deleting one row: fter The html:

<table id="OutputIssues" class="table" style="width: 645px; border-collapse: collapse;" border="1" rules="all"
    cellspacing="0" cellpadding="3">
    <thead>
        <tr>
            <td style="width: 355px;" align="left"><span style="text-decoration: underline;">New header name</span></td>
            <td style="width: 50px;" align="center"><span
                    style="font-family: arial, helvetica, sans-serif;">Rating</span></td>
            <td style="width: 90px; white-space: nowrap;" align="left">&nbsp;</td>
            <td style="width: 50px;" align="center"><span style="font-size: 10pt;">Status</span></td>
        </tr>
    </thead>
    <tbody>
        <tr style="border-bottom: none;" valign="top">
            <td>TAC: Evidence against 4 key items <br />
                <div style="margin-top: 1em;"><span style="text-decoration: underline;"><strong>Actions:</strong></span>
                </div>
            </td>
            <td align="center">Med</td>
            <td>&nbsp;</td>
            <td style="white-space: nowrap;" align="center"><span
                    id="ctl05_OutputIssues_ctl06_lblActionStatus">Open</span></td>
        </tr>
        <tr style="border-top: none; border-bottom: none;" valign="top">
            <td style="padding-top: 0px; padding-bottom: 3px;">
                <ul style="margin-top: 0px; margin-bottom: 0px;">
                    <li>Fair price, value for money modifications</li>
                </ul>
            </td>
            <td style="padding-top: 0px; padding-bottom: 3px;" align="center">&nbsp;</td>
            <td style="padding-top: 0px; padding-bottom: 3px;">&nbsp;</td>
            <td style="white-space: nowrap; padding-top: 0px; padding-bottom: 3px;" align="center"><span
                    id="ctl05_OutputIssues_ctl07_lblActionStatus">Open</span></td>
        </tr>
        <tr style="border-top: none; border-bottom: none;" valign="top">
            <td style="padding-top: 0px; padding-bottom: 3px;">
                <ul style="margin-top: 0px; margin-bottom: 0px;">
                    <li>Increased client satisfaction</li>
                </ul>
            </td>
            <td style="padding-top: 0px; padding-bottom: 3px;" align="center">&nbsp;</td>
            <td style="padding-top: 0px; padding-bottom: 3px;">&nbsp;</td>
            <td style="white-space: nowrap; padding-top: 0px; padding-bottom: 3px;" align="center"><span
                    id="ctl05_OutputIssues_ctl08_lblActionStatus">Open</span></td>
        </tr>
        <tr style="border-top: none; border-bottom: none;" valign="top">
            <td style="padding-top: 0px; padding-bottom: 3px;">
                <ul style="margin-top: 0px; margin-bottom: 0px;">
                    <li>Reducing interim accommodation costs</li>
                </ul>
            </td>
            <td style="padding-top: 0px; padding-bottom: 3px;" align="center">&nbsp;</td>
            <td style="padding-top: 0px; padding-bottom: 3px;">&nbsp;</td>
            <td style="white-space: nowrap; padding-top: 0px; padding-bottom: 3px;" align="center"><span
                    id="ctl05_OutputIssues_ctl09_lblActionStatus">Open</span></td>
        </tr>
        <tr style="border-top: none; border-bottom: none;" valign="top">
            <td style="padding-top: 0px; padding-bottom: 3px;">
                <ul style="margin-top: 0px; margin-bottom: 0px;">
                    <li>Reducing the end to end process time</li>
                </ul>
            </td>
            <td style="padding-top: 0px; padding-bottom: 3px;" align="center">&nbsp;</td>
            <td style="padding-top: 0px; padding-bottom: 3px;">&nbsp;</td>
            <td style="white-space: nowrap; padding-top: 0px; padding-bottom: 3px;" align="center"><span
                    id="ctl05_OutputIssues_ctl10_lblActionStatus">Open</span></td>
        </tr>
        <tr valign="top">
            <td>TAC: Document examples - Project Plan Acceptance form</td>
            <td align="center">Med</td>
            <td>&nbsp;</td>
            <td style="white-space: nowrap;" align="center"><span
                    id="ctl05_OutputIssues_ctl18_lblActionStatus">Open</span></td>
        </tr>
        <tr valign="top">
            <td>TAC: Need some idea of fees to be charged for example project</td>
            <td align="center">Med</td>
            <td>&nbsp;</td>
            <td style="white-space: nowrap;" align="center"><span
                    id="ctl05_OutputIssues_ctl19_lblActionStatus">Open</span></td>
        </tr>
    </tbody>
</table>

The diff image diff html:

<table id="OutputIssues" class="table" style="width: 645px; border-collapse: collapse;" border="1" rules="all"
    cellspacing="0" cellpadding="3">
    <thead>
        <tr>
            <td style="width: 355px;" align="left"><span style="text-decoration: underline;">New header
                    name</span></td>
            <td style="width: 50px;" align="center"><span
                    style="font-family: arial, helvetica, sans-serif;">Rating</span></td>
            <td style="width: 90px; white-space: nowrap;" align="left">&nbsp;</td>
            <td style="width: 50px;" align="center"><span style="font-size: 10pt;">Status</span></td>
        </tr>
    </thead>
    <tbody>
        <tr style="border-bottom: none;" valign="top">
            <td>TAC: Evidence against 4 key items <br>
                <div style="margin-top: 1em;"><span style="text-decoration: underline;"><strong>Actions:</strong></span>
                </div>
            </td>
            <td align="center">Med</td>
            <td>&nbsp;</td>
            <td style="white-space: nowrap;" align="center"><span
                    id="ctl05_OutputIssues_ctl06_lblActionStatus">Open</span></td>
        </tr>
        <tr style="border-top: none; border-bottom: none;" valign="top">
            <td style="padding-top: 0px; padding-bottom: 3px;">
                <ul style="margin-top: 0px; margin-bottom: 0px;">
                    <li>Fair price, value for money modifications</li>
                </ul>
            </td>
            <td style="padding-top: 0px; padding-bottom: 3px;" align="center">&nbsp;</td>
            <td style="padding-top: 0px; padding-bottom: 3px;">&nbsp;</td>
            <td style="white-space: nowrap; padding-top: 0px; padding-bottom: 3px;" align="center"><span
                    id="ctl05_OutputIssues_ctl07_lblActionStatus">Open</span></td>
        </tr>
        <tr style="border-top: none; border-bottom: none;" valign="top">
            <td style="padding-top: 0px; padding-bottom: 3px;">
                <ul style="margin-top: 0px; margin-bottom: 0px;">
                    <li>Increased client satisfaction</li>
                </ul>
            </td>
            <td style="padding-top: 0px; padding-bottom: 3px;" align="center">&nbsp;</td>
            <td style="padding-top: 0px; padding-bottom: 3px;">&nbsp;</td>
            <td style="white-space: nowrap; padding-top: 0px; padding-bottom: 3px;" align="center"><span
                    id="ctl05_OutputIssues_ctl08_lblActionStatus">Open</span></td>
        </tr>
        <tr style="border-top: none; border-bottom: none;" valign="top">
            <td style="padding-top: 0px; padding-bottom: 3px;">
                <ul style="margin-top: 0px; margin-bottom: 0px;">
                    <li>Reducing interim accommodation costs</li>
                </ul>
            </td>
            <td style="padding-top: 0px; padding-bottom: 3px;" align="center">&nbsp;</td>
            <td style="padding-top: 0px; padding-bottom: 3px;">&nbsp;</td>
            <td style="white-space: nowrap; padding-top: 0px; padding-bottom: 3px;" align="center"><span
                    id="ctl05_OutputIssues_ctl09_lblActionStatus">Open</span></td>
        </tr>
        <tr style="border-top: none; border-bottom: none;" valign="top">
            <td style="padding-top: 0px; padding-bottom: 3px;">
                <ul style="margin-top: 0px; margin-bottom: 0px;">
                    <li>Reducing the end to end process time</li>
                </ul>
            </td>
            <td style="padding-top: 0px; padding-bottom: 3px;" align="center">&nbsp;</td>
            <td style="padding-top: 0px; padding-bottom: 3px;">&nbsp;</td>
            <td style="white-space: nowrap; padding-top: 0px; padding-bottom: 3px;" align="center"><span
                    id="ctl05_OutputIssues_ctl10_lblActionStatus">Open</span></td>
        </tr>
        <tr valign="top">
            <td>TAC: <del class='diffdel'>Document examples - Project Plan Acceptance form</del></td>
            <td align="center"><del class='diffdel'>Med</del></td>
            <td><del class='diffdel'>&nbsp;</del></td>
            <td style="white-space: nowrap;" align="center"><span id="ctl05_OutputIssues_ctl18_lblActionStatus"><del
                        class='diffdel'>Open</del></span></td>
        </tr>
        <tr valign="top">
            <td><del class='diffdel'>TAC: </del>Need some idea of fees to be charged for example project</td>
            <td align="center">Med</td>
            <td>&nbsp;</td>
            <td style="white-space: nowrap;" align="center"><span
                    id="ctl05_OutputIssues_ctl19_lblActionStatus">Open</span></td>
        </tr>
    </tbody>
</table>

One of the highlighted diff cell is shown on the wrong row(the row next to the one actualy being deleted)

Is there any way to deal with situations like that? The layout tends to break when it comes to rows/column changes (addition&deletion or comination). Some suggestions:

  1. If we cannot always do a line-by-line diff correctly, can we do the diff on the whole table level?
  2. can we just skip the table? I am actually going to implement the skip logic my self by using some html parser to ignore the table..
mfaizan24 commented 2 years ago

did you manage to fix this?