Shoobx / xmldiff

A library and command line utility for diffing xml
MIT License
203 stars 53 forks source link

Why doesn't a best_match() exist? #26

Closed AuckeBos closed 1 year ago

AuckeBos commented 6 years ago

I was wondering, why isn't some sort of best_match() implemented? I've seen some cases where match() doesn't output the desired matches, as its focussed on finding the best matches for lnodes, and not the best matches in general. For example, take these two trees:

<html>
   <div id="el1">
      First element with some text
   </div>
   <div id="el2">
      Second element with some text
   </div>
</html>

and

<html>
   <div id="el3">
      Second element with some text
   </div>
</html>

I've added id attributes to refer to the elements. It could happen that el1 is matched to el3 in the first iteration of the loop in match(). After that, el3 is removed from rnodes, such that el2 can't match to it any more. If you'd implement something like the stable marriage problem, el2 and el3 would be matched, and el1 would eventually be marked as deleted. The matcher would namely see that even if el1 and el3 are good matches, el3 would prefer el2 over el1. When el2 is considered, el1 would be unmatched to el3, and el2 and el3 are matched. You could then make some kind of queue with unmatched nodes which you can alter during the iteration. When el1 is unmatched, you add it to the queue and give it another shot in a later iteration.

regebro commented 6 years ago

Sure, this is possible to implement. It will slow down the matching a lot, though, since you need to match not just each node with all unmatched nodes, but you need to compare every node with every node. But as an option it would be nice.

regebro commented 1 year ago

Version 2.6b1 has been released with an experimental --best-match argument.