fdintino / xydiff

GNU Lesser General Public License v2.1
20 stars 10 forks source link

NodesManager::FullBottomUp #4

Open pedro93 opened 8 years ago

pedro93 commented 8 years ago

Is this function https://github.com/fdintino/xydiff/blob/master/src/Diff_NodesManager.cpp#L626

If a matched has been found https://github.com/fdintino/xydiff/blob/master/src/Diff_NodesManager.cpp#L652 due to unique id attribute, the function will attempt to match to all parent nodes of that element until the root of the documents. This leads to uneven matches, for example:

<!DOCTYPE html>
<html>
    <head>
        <title> Registo de Encomendas </title>
        <meta charset="utf-8" />
        <link rel="stylesheet" href="style.css" />
    </head>
    <body>
        <div class="main_card_block">
            <div class="main_card_block_upper_cut"> </div>
            <div class="main_card_clip">
                <div class="back_clip">
                    <div class="front_clip">
                        <div class="ring_clip"> </div>
                    </div>
                </div>
            </div>
            <div class="main_card_title">
                <h1> Recepção de encomendas </h1>
                <h3> Por favor, autentifique-se. </h3>
            </div><!---->
            <div class="main_card_loginform">
                <form class="main_card_form" id="login_form" >
                    <input type="text" placeholder="username"/>
                    <input type="text" placeholder="email" />
                    <input type="password" placeholder="palavra passe" />
                    <button class="login_bt" id="login">entrar</button>
                </form>
            </div>
        </div>
    </body>
</html>

tested agaisnt

<!DOCTYPE html>
<html manifest="something">
    <head>
        <title> Registo de Encomendas </title>
        <meta charset="utf-8" />
        <link rel="stylesheet" href="style.css" />
    </head>
    <body>
        <div class="main_card_block">
            <div class="main_card_block_upper_cut"> </div>
            <div class="main_card_clip">
                <div class="back_clip">
                    <div class="front_clip">
                        <div class="ring_clip"> </div>
                    </div>
                </div>
            </div>
            <div class="main_card_title">
                <h1> Recepção de encomendas </h1>
                <h3> Por favor, autentifique-se. </h3>
            </div>
            <div class="main_card_loginform">
                <form class="main_card_form" >
                    <input type="text" placeholder="email" />
                    <input type="password" placeholder="palavra passe" />
                    <button class="login_bt" id="login">entrar</button>
                    <input type="text" placeholder="username"/>
                </form>
            </div>
        </div>
    </body>
</html>

The algorithm will correctly match the node but because the trees are of different sizes, doing the parent matching will lead to a match between the body element of the first document to the html element of the second document. Because of #3, the current implementation does not detect the id attribute as unique, hence no match will be done here. Does this not lead to an incorrect diff output?

I know this project is old and hard to understand, im sorry for this inconvenience and thank you very much for taking the time to read this.

fdintino commented 8 years ago

No need to apologize. I uploaded this project to git precisely with the hope that it would be useful to others, so I'm glad that you were able to benefit from some of my work.

I did not write all, or even most, of this library. That credit goes to a few French computer scientists in Inria about 15 years ago. My contribution to their efforts was to add some (relatively crude) text diffing functions and get the project to compile successfully on a modern C++ toolchain, along with patching a few memory leaks and making general bug fixes. At the time I only superficially dove into the underlying algorithm, though I do have a fair handle on it.

There are two issues here. The first is the code that I introduced that you mentioned in #3. But that is not the only reason it does not detect the id attribute as unique and use that to its advantage. The source code claims to use the DTD to extract unique ids, but this functionality was never actually implemented: the original code from Inria had this functionality commented out. It's partially my fault that I did not take the time to see why it was commented out and whether it could be fixed, but in any case it's not relevant to your use case. There is no DTD for HTML5 and there never will be. DTDs are insufficiently expressive to be able to validate even the strictest XHTML5 polyglot markup. Some people have gone through the trouble of writing an XML Schema for XHTML5 though I can't comment on how good it is.

Briefly skimming the Xerces-C++ documentation and code, it looks like it is possible to extract uniqueness constraint information from an XML Schema, and that could theoretically be used to produce better diffs. For that matter, the structure, even neglecting the uniqueness constraints, would vastly improve the diffing process: In a valid XHTML5 document there can be only one <html> <head>, or <body> element, so those are unambiguously the same elements in two documents.

All that said, I don't know that I will ever have the time to implement such extensive changes to improve the algorithm. Are you able to compile the source? If not, I could make a build for you that naively assumes that all attributes named id are unique throughout the document. Obviously it wouldn't be portable to all documents (I'm thinking specifically of atom feeds which can have embedded html markup, and so it's possible for the same id to occur several times in the document), but it would work fine for your purposes. That is something I definitely would have the time to do, if it would be helpful.

pedro93 commented 8 years ago

Thank you very much for the quick reply! I had thought that this project and the underlying algorithm was developed by you, in either case I greatly appreciate your help. I am able to correctly compile the project and have done so to add logs in order to understand how the project works, my only hindrance is my lack of knowledge of the Xerces-C++ library.

I will just have to assume certain cases during my port. If at any time you feel inclined to improve on this library let me know, I still believe this library is a treasure which should be maintained and updated.

You may close this issue, the function itself logical-wise works, my problem seems to be with the underlying XML parser or the registerSubTree function as mentioned in #3.