Shoobx / xmldiff

A library and command line utility for diffing xml
MIT License
204 stars 52 forks source link

XMLFormatter updating text #112

Closed Tomp0801 closed 10 months ago

Tomp0801 commented 1 year ago

Is there a way to get the XMLFormatter to mark text as updated instead of deleting old and inserting new text? Here is an example:

left_text = '<document><p>old text</p></document>'
right_text = '<document><p>new text</p></document>'
main.diff_texts(left_text, right_text, formatter=formatting.XMLFormatter())

For the p-node it outputs <p><diff:delete>old</diff:delete><diff:insert>new</diff:insert> text</p> Instead, I would like to get something like <p><diff:update old_text="old">new</diff:update> text</p> or <p diff:update-text-in old_text="old text">new text</p>

I know that the edit script is capable of creating text-update diffs. Is it also possible with the XMLFormatter?

regebro commented 1 year ago

No, currently it will diff the text and show both the old and new text so you can see the difference.

I would accept a pull request to add a setting to change that.

Could you explain the use case a bit more?

Tomp0801 commented 1 year ago

Basically, I am marking changes: deleted text red, inserted text green and updated text should be blue, with a tooltip showing the old text, when hovering over it. So currently, both the deleted and inserted text is shown.

I don't think I can use the edit script, because I don't want to actually delete nodes. So the xpaths are not correct, because I skip some of the steps.

I'll take a look at your source code, I haven't done that yet.

regebro commented 1 year ago

Ah, yes, for the tooltip the syntax you suggested would be helpful.

I think that another option for the XMLFormatter, or a new class of Formatter subclassing the XMLFormatter, could do this by changing the tags inserted by _make_diff_tags().

Tomp0801 commented 1 year ago

I've started looking into it. I saw that in diff_match_patch.py and formatting.py, the changes are passed as a tuple (CHANGE_CODE, text). With an update, we need two texts though, so I am not sure, how best to handle it. Just a third element in the tuple (DIFF_UPDATE, text_new, text_old), joining both texts to one string and separating them again later, or maybe a simple class that holds the data.

Maybe you have a suggestion.

regebro commented 1 year ago

The text differ compares two texts and removes bits that should not be there and inserts bits that isn't there. An update is just a delete followed by an insert (or possibly an insert followed by a delete), so the formatter could just treat those as updates.

There isn't a lot of places in the diff that specifically creates first a delete and then an insert, so if we relied on that you wouldn't get many updates.

Tomp0801 commented 1 year ago

Alright, I made a pull request. It considers insert/delete and delete/insert.