dolthub / dolt

Dolt – Git for Data
Apache License 2.0
17.45k stars 498 forks source link

dolt diff does not display added and corresponding deleted rows next to each other #7808

Open liuliu-dev opened 2 months ago

liuliu-dev commented 2 months ago

this affects how we display diff rows on dolthub

repro:

related dolthub pull: https://www.dolthub.com/repositories/liuliu/test1/pulls/11/compare

nicktobey commented 2 months ago

I'm not sure this is even possible to fix without breaking history-independence.

Commits store the current state of the table, but they don't track the commands that led to that state. This is on purpose: two tables with the same state should have the same hash regardless of the sequence of commands that were run.

This means that there's no way for the commit to know that the added and deleted row were from the same update operation.

liuliu-dev commented 2 months ago

@nicktobey Seems this is not about the sequence of commands? I might be wrong since I'm not familiar with dolt implementation...

A modified row is displayed as deleted and added rows, and we hope to display them next to each other. But when a commit change happens in primary keys, it might mess up this order, since the diff rows are sorted by primary keys.

Here is another example that Tim found if that's clearer: https://www.dolthub.com/repositories/dolthub/hospital-price-transparency-v3/pulls/169/compare, where the values of column code changed from 51079- 0667- to 51079- 0667 in this commit, makes all the added rows go up.

Screenshot 2024-05-14 at 9 33 20 AM

nicktobey commented 2 months ago

The reason I mentioned the sequence of commands is because, for instance, the diff you linked to could have happened in multiple ways:

The diff operation only sees the start and end state of the table, so it can't tell the difference between these two possibilities. It has to display the same diff for both.

One thing we could do is attempt to detect when a change was likely an update to a primary key, match each row-add to its corresponding row-remove, and display them next to each other. It means that would do this even if the change wasn't the result of changing a primary key, as long as it could have been... but that's probably rare enough that we don't mind.

The question is how we would detect these pairs of rows and whether we can do it efficiently.

liuliu-dev commented 2 months ago

I see now, thanks for the clarification :)