erezsh / reladiff

High-performance diffing of large datasets across databases
https://reladiff.readthedocs.io/en/latest/index.html#
Other
366 stars 9 forks source link

Add option to skip sorting in hashdiff for improved performance #45

Closed alex-mirkin closed 1 month ago

alex-mirkin commented 2 months ago

Added an option to skip sorting results in hashdiff for improved performance, when there is a large number of differences. When enabled, entries with the same key but different column values may not appear adjacent in the output. This change discussed in https://github.com/erezsh/reladiff/pull/41.

alex-mirkin commented 2 months ago

Could you advise on how the tests should look like? I’m not entirely sure about the best approach to test this functionality within the existing testing framework.

erezsh commented 2 months ago

You could place your test in test_diff_tables. You can run a diff that only diffs one segment, so you can check that the order is preserved. (assuming the data has a non-sorted order)

I think you can ensure one segment by putting a very low bisection-factor of 1 or 2, and high bisection threshold.

erezsh commented 1 month ago

Thanks for the PR!