datafold / data-diff

Compare tables within or across databases
https://docs.datafold.com
MIT License
2.95k stars 272 forks source link

Detect duplicate rows on each side #850

Closed nolar closed 10 months ago

nolar commented 10 months ago

A known flaw: if there are equal duped rows, e.g.:

A: [pk=1000, val=hello], [pk=1000, val=hello]
B: [pk=1000, val=hello], [pk=1000, val=hello]

… then we might not notice them even on the level of checksum scanning of table segments. If the segments are fully equal, these dupes will never be yielded, neither with -/+, nor with a potentially different informational marker * introduced specially for dupes. It will only be noticed in segments that have some other (unrelated) differences. Which makes this dupe-detection not fully reliable.