datafold / data-diff

Compare tables within or across databases
https://docs.datafold.com
MIT License
2.95k stars 265 forks source link

Getting "ValueError: range() arg 3 must not be zero" error for multi iteration checks #197

Closed gaurav1308 closed 1 year ago

gaurav1308 commented 2 years ago

We are evaluating data-diff for our usecase. We are facing issue when multi step iteration is being performed ie when we are reducing bisection-threshold This is working fine when bisection-threshold is high enough so that everything is done in one iteration.

data-diff trino://gaurav.singh@razorpay.com@trino-dev-coordinator-service.trino-dev.svc.cluster.local:8080/hive/sqoop_api sqoop_api.merchants trino://gaurav.singh@razorpay.com@trino-dev-coordinator-service.trino-dev.svc.cluster.local:8080/hive/realtime_hudi_api realtime_hudi_api.merchants -k id -v --json --bisection-factor 9 --bisection-threshold 100000 --max-age=7d -t created_date -c name -c email -c second_factor_auth -c restricted -c parent_id -c fee_model --min-age=1d -s -w "updated_at<1659724200 and created_date<'2022-08-08'"

In second case when we reduced bisection-threshold enough so that all diffs can't be performed in one iteration data-diff trino://gaurav.singh@razorpay.com@trino-dev-coordinator-service.trino-dev.svc.cluster.local:8080/hive/sqoop_api sqoop_api.merchants trino://gaurav.singh@razorpay.com@trino-dev-coordinator-service.trino-dev.svc.cluster.local:8080/hive/realtime_hudi_api realtime_hudi_api.merchants -k id -v --json --bisection-factor 9 --bisection-threshold 1000 --max-age=7d -t created_date -c name -c email -c second_factor_auth -c restricted -c parent_id -c fee_model --min-age=1d -s -w "updated_at<1659724200 and created_date<'2022-08-08'"

getting following error


ValueError: range() arg 3 must not be zero

 File "/usr/lib/python3.9/concurrent/futures/_base.py", line 600, in result_iterator
    yield fs.pop().result()
  File "/usr/lib/python3.9/concurrent/futures/_base.py", line 433, in result
    return self.__get_result()
  File "/usr/lib/python3.9/concurrent/futures/_base.py", line 389, in __get_result
    raise self._exception
  File "/usr/lib/python3.9/concurrent/futures/thread.py", line 52, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/usr/local/lib/python3.9/dist-packages/data_diff/diff_tables.py", line 493, in _diff_tables
    yield from self._bisect_and_diff_tables(table1, table2, level=level, max_rows=max(count1, count2))
  File "/usr/local/lib/python3.9/dist-packages/data_diff/diff_tables.py", line 446, in _bisect_and_diff_tables
    checkpoints = table1.choose_checkpoints(self.bisection_factor - 1)
  File "/usr/local/lib/python3.9/dist-packages/data_diff/diff_tables.py", line 180, in choose_checkpoints
    checkpoints = split_space(self.min_key.int, self.max_key.int, count)
  File "/usr/local/lib/python3.9/dist-packages/data_diff/utils.py", line 19, in split_space
    return list(range(start, end, (size + 1) // (count + 1)))[1 : count + 1]```
gaurav1308 commented 2 years ago

PS: Using this as we have alphanumeric ids

pip install git+https://github.com/datafold/data-diff.git@alphanum_ids

https://github.com/datafold/data-diff/issues/59#issuecomment-1194403178

erezsh commented 2 years ago

Thanks for reporting this. I can't reproduce it, so it would be helpful if you could let me know the values that are being used.

Before the line:

            checkpoints = split_space(self.min_key.int, self.max_key.int, count)

If you could add -

            print("$$$$$", self.min_key, self.max_key, count)

And paste here the results?

gaurav1308 commented 2 years ago

These are the values k id -v --json --bisection-factor 10 --bisection-threshold 1000 --max-age=7d

gaurav1308 commented 2 years ago

It seems like I don't have permission on github to push above change Permission to datafold/data-diff.git denied to gaurav1308.

erezsh commented 2 years ago

These are the values

That's not what I asked..

Permission to datafold/data-diff.git denied

Yes, of course. Why would you have permissions to push to data-diff? Contributions have to come in the form of pull requests.

gaurav1308 commented 2 years ago

Params and inputs: data-diff trino://gaurav.singh@razorpay.com@trino-dev-coordinator-service.trino-dev.svc.cluster.local:8080/hive/sqoop_api sqoop_api.merchants trino://gaurav.singh@razorpay.com@trino-dev-coordinator-service.trino-dev.svc.cluster.local:8080/hive/realtime_hudi_api realtime_hudi_api.merchants -k id --json --bisection-factor 10 --bisection-threshold 1000 --max-age=7d -t created_date -c name -c email -c second_factor_auth -c restricted -c parent_id -c fee_model -v

Attaching log file error.txt

@erezsh Let me know if that helps

erezsh commented 2 years ago

@gaurav1308 That's exactly what I need, thank you. Let me look into it and see if I can find the problem.

erezsh commented 2 years ago

We have a new implementation for alphanumerics in master, that I believe should fix this issue.

Sorry it took so long, but please try now and see if it helps.

gaurav1308 commented 1 year ago

Looks like this was fixed