dolthub / dolt

Dolt – Git for Data
Apache License 2.0
17.88k stars 507 forks source link

Dolt diff does not show diff inside JSON #3468

Closed mitar closed 2 years ago

mitar commented 2 years ago

When I am using JSON support as described here with diffing I am noticing that diff does not really show what has changed inside JSON, but only shows that the whole JSON changed:

+-----+----+---------------------------------------------+
|     | id | doc                                         |
+-----+----+---------------------------------------------+
|  <  | 1  | [{"a": 1, "b": 5}, {"e": 2.71, "pi": 3.14}] |
|  >  | 1  | [{"a": 1, "b": 6}, {"e": 2.71, "pi": 3.14}] |
+-----+----+---------------------------------------------+

My understanding was that JSON is converted to a noms structure so deep diffing should be possible? Is this just a limitation of diff tool or is there a more fundamental issue which prevents diffing nested structures?

fulghum commented 2 years ago

Thanks for opening this issue @mitar – this would be a nice enhancement to the diff experience for people working with with JSON data. We have the info we need to detect the JSON difference, so it seems like the work would be figuring out the best way to display the JSON diff and updating the diff output logic.

bpf120 commented 2 years ago

Hi @mitar , thanks again for making this issue. We'd love to learn more about how you're using Dolt. Feel free to email me (brianf@dolthub.com) or swing by our Discord.

https://discord.com/invite/RFwfYpu

mitar commented 2 years ago

I am not yet using it, I am exploring options available. I have been working on a bunch of new generation open source collaboration apps (like new generation Wikipedia, Google Docs, etc.). I have been using Meteor until now but I think I stretched it as far as possible and I am now planing to work on my own backend/database implementation. But before I do that, I wanted to check what else exists around I maybe missed. So I found Dolt and I really love the stuff you are doing. It is not completely aligned with my needs (for example, I would need a way to store also exact patch between two versions of data, not just diff, because diff is inferred and not uniquely defined from two versions, while patch is exact and can then be used to do operational transform and such), but I still wanted to see what are plans about some of the features which would help me a lot, if I ended up using it. Being schema-less JSON-based is something I found very useful in general.

timsehn commented 2 years ago

dolt diff -r sql produces the patch :-)

Try it!

--Tim

On Mon, May 23, 2022 at 11:28 AM Mitar @.***> wrote:

I am not yet using it, I am exploring options available. I have been working on a bunch of new generation open source collaboration apps (like new generation Wikipedia, Google Docs, etc.). I have been using Meteor until now but I think I stretched it as far as possible and I am now planing to work on my own backend/database implementation. But before I do that, I wanted to check what else exists around I maybe missed. So I found Dolt and I really love the stuff you are doing. It is not completely aligned with my needs (for example, I would need a way to store also exact patch between two versions of data, not just diff, because diff is inferred and not uniquely defined from two versions, while patch is exact and can then be used to do operational transform and such), but I still wanted to see what are plans about some of the features which would help me a lot, if I ended up using it.

— Reply to this email directly, view it on GitHub https://github.com/dolthub/dolt/issues/3468#issuecomment-1135003157, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABJAR3ETXWGQYKZDL3UJXBLVLPE6BANCNFSM5WTE4KBQ . You are receiving this because you are subscribed to this thread.Message ID: @.***>

mitar commented 2 years ago

I know that you can diff two versions, but diffing two versions does not necessary mean that this is how the later version came to be. The changes which were made and resulted in the later version might be different than what diffing algorithm determines. (Please keep in mind that I am primarily interested in diffing JSON values.) So when I apply a JSON Patch to one value to get a new version of that JSON, diff will not necessary determine exactly the same patch.

timsehn commented 2 years ago

You're right. It's not a query log. dolt diff -r sql will give you an update query that updates the whole JSON blob.

But if you apply the diff you will get the same storage result? What case are you imagining where this could produce different results? I'd like to understand better.

--Tim

On Mon, May 23, 2022 at 11:46 AM Mitar @.***> wrote:

I know that you can diff two versions, but diffing two versions does not necessary mean that this is how the later version came to be. The changes which were made and resulted in the later version might be different than what diffing algorithm determines. (Please keep in mind that I am primarily interested in diffing JSON values.) So when I apply a JSON Patch to one value to get a new version of that JSON, diff will not necessary determine exactly the same patch.

— Reply to this email directly, view it on GitHub https://github.com/dolthub/dolt/issues/3468#issuecomment-1135019873, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABJAR3GCNTPSLDRJBQXAML3VLPHAFANCNFSM5WTE4KBQ . You are receiving this because you commented.Message ID: @.***>

mitar commented 2 years ago

Sure.

Yes, the result is the same and everything is fine ... until there is a merge conflict. Once there is a merge conflict, you have to resolve merge conflict. Manually, or using operation transform, or CRDTs, or something. At that moment it becomes critical to know exactly how the change happened so that you can have higher chance of maybe automatically resolving it in a way which makes sense to a human user.

Example. JSON Patch has move operator. Now imagine that you have a document where somebody moves one paragraph from place A to place B. At the same time somebody makes part of the text inside the same paragraph into a link. Both changes are pushed to the server and you have a merge conflict. If you just do a diff of the first change, you see rows removed and rows added. It is hard to figure out how to resolve the merge conflict. But if you have patch, you see that it was paragraph moved. Then it is easy: you move the paragraph and make the link in the text on the new location.

I think the main point is that patches can be in a richer language than simple diff. Operational transform and CRDTs are examples of such richer language. JSON Patch as well. So yea, you can compute a diff, but that diff might contain less information than a patch in a richer language would. Of course, if you just compare patches in same language as diffs, then it is harder to see the point. But even then: diffing algorithms have hard time determining the minimal change. It could be that the patch contains this minimal change while diffing algorithm finds a larger diff. And just because diff is larger you trigger a merge conflict, while a smaller patch might not cross with another smaller patch.

timsehn commented 2 years ago

So, Dolt will throw a conflict in this case. But you are correct without more introspection into the source of the changes, automated conflict resolution would certainly be more challenging.

Obviously, a patch workflow kind of just stomps on any conflicting merges.

I understand your use case now.

Dolt's support for JSON is more at "the cell level". We make an effort to de-dupe data at the storage layer using our storage engine. We could also use that same logic to produce better diffs and conflict resolution but the challenge there is UI as @fulghum alluded to. However, our focus is definitely more on tabular data. So, I'm not sure how much Dolt improvement you can expect in the short term in the JSON space.

mitar commented 2 years ago

We could also use that same logic to produce better diffs

As stated (but I do not have proof, just anecdotal experience, which maybe somebody someday prove me wrong), no amount of diffing can reconstruct the original patch. :-)

What is needed is to store the patch itself (together with the resulting data). It can be optional, storing it only when it is known. For example, in Dolt you could imagine that you would store every SQL statement which changed data as-is (I think you reconstruct it now?). It is like storing .sqlhistory into database as well. You are currently storing it into a file. Why. You could just make it part of data history as well. :-) And I could then also store JSON Patch for changes to the JSON value.

Anyway, I didn't want to try to get Dolt to implement this, but you asked and I tried to explain my use case and what I was searching for. I could not find it anywhere. So I will probably do my own. :-)

This issue is just about improving diffing of JSON (and not about storing patches).

timsehn commented 2 years ago

The problem with storing patches/history is there is no compression. The further you go back in time the slower it gets. This is like version control pre-Git. Dolt uses a content-addressed binary tree (new) and a Merkle DAG (like Git) to allow fast querying of any revision in history and fast diff/merge.

Good chat though :-)

mitar commented 2 years ago

You store both the patches and the data. :-) It is like storing both data, indices, and oplog forever. :-) You can use whichever you need (data for scanning, indices for quick access, and oplog for replication, when a new peer comes back - and automatic conflict resolution if they have conflicts on their local copy).

Anyway, this is just theoretical from my side at the moment. I will try to find time to implement something and then we can see what happens. :-)

timsehn commented 2 years ago

Closing in favor of this: #2429