erezsh / reladiff

High-performance diffing of large datasets across databases
https://reladiff.readthedocs.io/en/latest/index.html#
Other
366 stars 9 forks source link

Allow comparison of String_UUID and Text #51

Open snordhausen opened 3 weeks ago

snordhausen commented 3 weeks ago

After switching from data-diff to reladiff, one of our tables started failing with

TypeError: Incompatible types for column 'the_column_name': String_UUID() <-> Text()

As far as I can tell, the root cause is that this code in sqeleton/databases/base.py uses sampling to determine if a column is String_UUID. But when such a column is nullable, it depends on chance if the sample contains NULL values. If it does, sqeleton will use the Text type, otherwise it will use String_UUID.

This change ensures that such columns can be compared reliably. I'm not sure if the is the best way to do it, but in our internal testing it looks good.

erezsh commented 3 weeks ago

Thanks for reporting this and suggesting a solution.

I think a better solution would be to do better sampling, for example by adding WHERE pk_col IS NOT NULL. What do you think?

snordhausen commented 3 weeks ago

That sounds promising, I just don't know exactly how the sampling mechanism works. If the sampling happens column-by-column, it should work.

But if we apply the IS NOT NULL to all columns at the same time, wouldn't we risk not getting any samples at all? E.g when one of the columns is completely NULL because it is used for a new software feature that is not yet live.

erezsh commented 3 weeks ago

That's a good point. If a column is entirely null, there is no way to verify if it's UUID or not.

At the same time, if we use a column that is arbitrary text as a key, the diff might return incorrect results.

But it shouldn't be too hard to mark a column that is all null (or empty) as such. Then when comparing the column types, it could safely adopt whatever is the type in the other table.