erezsh / reladiff

High-performance diffing of large datasets across databases
https://reladiff.readthedocs.io/en/latest/index.html#
Other
356 stars 9 forks source link

Duplicate rows support in HashDiffer's diff_sets() is partial #43

Closed alex-mirkin closed 1 month ago

alex-mirkin commented 1 month ago

Describe the bug

HashDiffer’s handling of duplicate rows only partially supports expected behavior. Specifically, there are two scenarios that need to be addressed:

  1. A row appears multiple times in one table (e.g., table A) but not in the other, and each occurrence should be flagged as a difference.
  2. A row exists in both tables (A and B), but with different numbers of duplicates - for example, 4 occurrences in A and 2 in B. In this case, 2 rows should be reported as missing from table B.

While the first scenario is correctly handled, the second one is not.

Steps to reproduce - use the following values:

a = [
    (12, "ABCDE"),
    (12, "ABCDE"),
    (6, "ABCDE")
]

b = [
    (4, "ABCDEF"),
    (4, "ABCDE"),
    (4, "ABCDE"),
    (6, "ABCDE"),
    (6, "ABCDE")
]

The expected output:

('+', (4, 'ABCDEF'))
('+', (4, 'ABCDE'))
('+', (4, 'ABCDE'))
('-', (12, 'ABCDE'))
('-', (12, 'ABCDE'))
('-', (6, 'ABCDE'))

The actual output:

('+', (4, 'ABCDEF'))
('+', (4, 'ABCDE'))
('+', (4, 'ABCDE'))
('-', (12, 'ABCDE'))
('-', (12, 'ABCDE'))

In the actual output, the (6, "ABCDE") row difference is not detected, even though it exists once in list a but twice in list b.

erezsh commented 1 month ago

Thanks, I'll try to fix it soon.

erezsh commented 1 month ago

I pushed the fix to PR #44.

Since you already have benchmarks in place, could I inconvenience you to check that there is not too much degradation in performance?

alex-mirkin commented 1 month ago

Looks good! Sure, the fixed version has about 90% speed of the original one on the smaller list sizes (5000) and about the same performance on the larger ones (50,000, 100,000).

erezsh commented 1 month ago

Great, thanks!