Shoobx / xmldiff

A library and command line utility for diffing xml
MIT License
204 stars 52 forks source link

Unbelievably slow or ineffecient #119

Closed otheus closed 1 year ago

otheus commented 1 year ago

Greetings. I just installed this tool (v2.6.3) with Python 3.11.4. I tried to compare two xml files around 900 kB in size. It needed nearly 9 minutes of ~99% CPU on core i5 3.4 GHz. This seems a bit absurd.

I tried looking up the algorithm you based your code on. The link to your paper in the documentation is dead.(http://ilpubs.stanford.edu:8090/115/1/1995-46.pdf). I was able to read the paper from here: A DOI based link is here: https://dl.acm.org/doi/abs/10.1145/235968.233366

A more recent version and review of prior algorithms can be found at https://iopscience.iop.org/article/10.1088/1757-899X/1022/1/012054 . Let me know if you need access to the PDF.

Question: does "delete" mean a deletion of a node from file1 or from file2? It would be nice if the command-line clarified that.

Many thanks.

regebro commented 1 year ago

Diffing XML isn't a trivial problem, and available papers are generally about how to generate an accurate diff, not a fast one.

If we guess your file has an average of 100 characters per node, you would have 9000 nodes in it. In a naive implementation a diff of that would require a minimum of 40 million node comparisons, each node comparison needing to take into account multiple aspects of the node. It's not a quick task.

xmldiff employes many different techniques to lower the amount of comparisons done. That's where possible speedups lie, not in changes to the algorithm.

If you think you can significantly speed up xmldiff you are welcome to try.

regebro commented 1 year ago

Delete means that a node in file1 does not exist in file2.