erezsh / reladiff

High-performance diffing of large datasets across databases
https://reladiff.readthedocs.io/en/latest/index.html#
Other
328 stars 5 forks source link

Allow empty table in reladiff #21

Closed Komalis closed 1 month ago

Komalis commented 1 month ago

Here is a quick example on why we can't use reladiff on empty tables.

Following : https://github.com/erezsh/sqeleton/issues/16

CREATE TABLE public.toast_one (
    id INTEGER,
    name TEXT
);

CREATE TABLE public.toast_two (
    id INTEGER,
    name TEXT
);

INSERT INTO public.toast_one (
    id,
    name
)
VALUES
(1, 'Hello');

Then

reladiff -k id -s -j4 postgresql://xxxx:xxxxx@xxxx:xxxx/xxxx public.toast_one postgresql://xxxx:xxxxx@xxxx:xxxx/xxxx public.toast_two --verbose
[10:43:40] INFO - [PostgreSQL] Starting a threadpool, size=4.
[10:43:41] INFO - Diffing schemas...
[10:43:41] INFO - Diffing using columns: key=('id',) update=None extra=().
[10:43:41] INFO - Using algorithm 'joindiff'.
[10:43:41] INFO - Validating that the are no duplicate keys in columns: ['id']
[10:43:41] INFO - Diffing segments at key-range: (1)..(2). size: table1 <= 1, table2 <= 1
[10:43:41] INFO - . Diffing segment 1/1, key-range: (1)..(2), size <= None
[10:43:42] INFO - Validating that the are no duplicate keys in columns: ['id']
[10:43:42] ERROR - Table appears to be empty

or with a python script

from reladiff import Algorithm, connect_to_table, diff_tables

table1 = connect_to_table(
    "xxxx",
    "public.toast_one",
    ("id",),
    thread_count=8,
)
table2 = connect_to_table(
    "xxxxx",
    "public.toast_two",
    ("id",),
    thread_count=8,
)

missing_count = 0
new_count = 0

for sign, row in diff_tables(
    table1,
    table2,
    algorithm=Algorithm.JOINDIFF,
    threaded=True,
    max_threadpool_size=8,
    validate_unique_key=False,
):
    print(f"{sign} {row}")
    if sign == "-":
        missing_count += 1
    elif sign == "+":
        new_count += 1

Traceback (most recent call last):
  File "/home/xxxxxxxx/reladiff/main.py", line 19, in <module>
    for sign, row in diff_tables(
  File "/home/xxxxxxxx/.local/share/mise/installs/python/3.11.9/lib/python3.11/site-packages/reladiff/diff_tables.py", line 98, in __iter__
    for i in self.diff:
  File "/home/xxxxxxxx/.local/share/mise/installs/python/3.11.9/lib/python3.11/site-packages/reladiff/diff_tables.py", line 183, in _diff_tables_wrapper
    yield from self._diff_tables_root(table1, table2, info_tree)
  File "/home/xxxxxxxx/.local/share/mise/installs/python/3.11.9/lib/python3.11/site-packages/reladiff/joindiff_tables.py", line 161, in _diff_tables_root
    yield from self._bisect_and_diff_tables(table1, table2, info_tree)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/xxxxxxxx/.local/share/mise/installs/python/3.11.9/lib/python3.11/site-packages/reladiff/diff_tables.py", line 253, in _bisect_and_diff_tables
    min_key2, max_key2 = self._parse_key_range_result(key_types1, next(key_ranges))
                                                                  ^^^^^^^^^^^^^^^^
  File "/home/xxxxxxxx/.local/share/mise/installs/python/3.11.9/lib/python3.11/site-packages/reladiff/diff_tables.py", line 65, in _thread_as_completed
    yield future.result()
          ^^^^^^^^^^^^^^^
  File "/home/xxxxxxxx/.local/share/mise/installs/python/3.11.9/lib/python3.11/concurrent/futures/_base.py", line 449, in result
    return self.__get_result()
           ^^^^^^^^^^^^^^^^^^^
  File "/home/xxxxxxxx/.local/share/mise/installs/python/3.11.9/lib/python3.11/concurrent/futures/_base.py", line 401, in __get_result
    raise self._exception
  File "/home/xxxxxxxx/.local/share/mise/installs/python/3.11.9/lib/python3.11/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/xxxxxxxx/.local/share/mise/installs/python/3.11.9/lib/python3.11/site-packages/reladiff/table_segment.py", line 279, in query_key_range
    raise ValueError("Table appears to be empty")
ValueError: Table appears to be empty

The behaviour that I was expecting is that reladiff tells me that the diff is

- (1, "Hello")

or

+ (1, "Hello")

Depending on which table we are doing the compare.

erezsh commented 1 month ago

Do you need to diff against empty tables in a real use-case, or is it just for tests?

Yes i need to diff against empty tables in a real use case, it's even more useful because sometimes I use the "where" argument, which gives make an empty result, and it is expected.

That makes sense!

Would adding a --allow-empty flag be a good enough solution for you?

erezsh commented 1 month ago

p.s. I think that error on empty probably wasn't the best design choice. Maybe in some future version I'll remove it completely.

Komalis commented 1 month ago

Do you need to diff against empty tables in a real use-case, or is it just for tests?

Yes i need to diff against empty tables in a real use case, it's even more useful because sometimes I use the "where" argument, which gives make an empty result, and it is expected.

That makes sense!

Would adding a --allow-empty flag be a good enough solution for you?

Yes it would be very useful! 🙏

erezsh commented 1 month ago

@Komalis I have an initial implementation in PR #22 .

You're welcome to try it. Just don't forget to upgrade sqeleton to 0.1.5

I'll some tests and documentation soon.