erezsh / reladiff

High-performance diffing of large datasets across databases
https://reladiff.readthedocs.io/en/latest/index.html#
Other
366 stars 9 forks source link

Diff sets optimize #41

Closed alex-mirkin closed 2 months ago

alex-mirkin commented 2 months ago

Fixed diff_sets()signature, and used set operators to find symmetric_difference. The performance was improved by x1.8 on local benchmarks of various dataset sizes (1,000, 5,000, and 50,000 rows).

Some thoughts - do we need the results to be sorted by key? I know it's convenient in the log, but maybe we can add an option to skip the sort in cases where it's irrelevant? (for example, writing the diffs to the database and analyzing them with SQL, when a big number of diffs is expected)

erezsh commented 2 months ago

The failures are happening because we support duplicate rows (which the set eliminates)

erezsh commented 2 months ago

I'm generally in favor of adding options to disable/enable certain features, if the difference is significant. But I'm a bit surprised that the performance difference when sorting is so noticeable for you.

alex-mirkin commented 2 months ago

I see, totally missed that.

However, I'm wondering if it is useful to show all duplicate instances in the log instead of just one. Also, if the motivation is supporting duplicate rows - if a row exists five times in table A and only one time in table B, will there be a difference detected?

erezsh commented 2 months ago

However, I'm wondering if it is useful to show all duplicate instances in the log instead of just one.

if a row exists five times in table A and only one time in table B, will there be a difference detected?

Yes, that's the intended behavior.

When the goal is do detect any changes, a deduplicated table is technically different.

However, there is already a switch `--assume-unique-key, that currently only applies to the joindiff algorithm, but perhaps we can also apply it to the diffing function.

And also, perhaps we can query the schema for uniqueness constraints, and then accordingly assume the key is unique even without the switch. (but that would require an implementation per database)

alex-mirkin commented 2 months ago

From my understanding, supporting duplicate rows involves handling two cases:

  1. A row appears exclusively in one of the tables (e.g., table A) as multiple duplicates, and each occurrence should be returned as a diff.
  2. A row exists in both tables (A and B), but with a different number of duplicates, such as 4 times in A and 2 times in B. In this case, 2 differences should be reported as missing in table B.

I believe the second scenario is more critical, but it doesn’t seem to be supported currently. Do you think we should consider adding support for this case?

As for querying the schema for uniqueness constraints, that sounds like an elegant solution, but I think more thought needs to be put into ensuring it’s clear to the end user what behavior they should expect with duplicate rows.

I like the idea of handling this by applying the --assume-unique-key switch to the diffing function. With that switch turned on, the user most likely wouldn’t expect reladiff to detect duplicates.

Regarding performance, as I mentioned earlier, I’m seeing approximately 1.8 times improvement when benchmarking the diff_sets function without duplicate row support (using set operators instead of a for loop), and this grows to 3.3 times without sorting the output by key.

How do you feel about adding a --skip-output-sort switch so the user can choose between output readability and performance?

erezsh commented 2 months ago

I believe the second scenario is more critical, but it doesn’t seem to be supported currently.

Interesting! If it doesn't work, I consider it a bug, and it should be fixed.

I think more thought needs to be put into ensuring it’s clear to the end user what behavior they should expect with duplicate rows.

Can you expand on that? The way I'm picturing it, it would simply be an optimization that shouldn't affect the results. i.e. detecting a unique key would have the same effect as assume-unique-key, except it won't require user interaction.

and this grows to 3.3 times without sorting the output by key.

How many differences are detected (i.e. sorted) in this benchmark?

How do you feel about adding a --skip-output-sort switch so the user can choose between output readability and performance?

I'm okay with it.

One thing though, I think these are separate issues, and it's probably better to address them in 3 separate PRs.

alex-mirkin commented 2 months ago

Interesting! If it doesn't work, I consider it a bug, and it should be fixed.

I'll open a bug report issue for that.

Can you expand on that?

Currently, duplicate record detection is always enabled, but if we make it configurable via the existing switch, how would that interact with the schema querying for uniqueness constraints? Specifically, how would we reconcile the switch with the schema feature, which would enable or disable duplicate detection on a per-table basis? Or your idea is to implement the schema querying feature instead of the applying the existing switch?

How many differences are detected (i.e. sorted) in this benchmark?

I tested several list sizes: 100, 1,000, 5,000, 50,000, and 100,000, with 10% of the records being diffs in each test. The higher the percentage of diffs, the greater the performance improvement observed without using sorting.

With a small number of diffs the difference in performance is negligible, but I believe having this option would provide greater flexibility, especially when a large number of diffs is expected or when the size of each record is significantly large.

it's probably better to address them in 3 separate PRs.

Good point.

alex-mirkin commented 2 months ago

I'm closing this one.