GoogleCloudPlatform / professional-services-data-validator

Utility to compare data between homogeneous or heterogeneous environments to ensure source and target tables match
Apache License 2.0
399 stars 114 forks source link

Accented string causes validation_status=fail in TD VS BQ row validation #1170

Closed franklinWhaite closed 1 month ago

franklinWhaite commented 3 months ago

When running a row validation, the hash for a string in BigQuery is different to the hash of the same string in Teradata if the string has an accented vowel. (I suspect the issue might also occur for any non ascii char).

This causes a failed row validation for strings that are identical.

nj1973 commented 3 months ago

Hi @franklinWhaite, I think I've reproduced this but would like to be sure I am looking at the same problem you are experiencing.

Can you find a sample problem row and validate it using --concat instead of --hash? Do you then get a match?

nj1973 commented 3 months ago

Also noting that we have a broken link in the installation readme:

https://github.com/GoogleCloudPlatform/professional-services-data-validator/blob/fef5307289d5aa8c210eef769f38d366535b41db/docs/installation.md?plain=1#L108

franklinWhaite commented 3 months ago

Hi @nj1973,

Thanks for looking into this, I have a similar issue when using concat. The validation continues to fail.

sundar-mudupalli-work commented 1 month ago

Hi,

The issue is there are multiple representations for a character in binary - UTF8 (which is a representation of Unicode), LATIN etc. ASCII representation is nearly universal. Our code worked as long as representation was ASCII - which is a subset of both LATIN and UTF8. When you go beyond ASCII, DVT does not work correctly.

The fix is to ensure we are computing the message digest for UTF8 representation - that is the representation that BQ uses and Unicode is a superset of LATIN.

When you are doing concat etc - it is more complicated because you cannot display things in our terminal if your terminal does not accept that representation.

Sundar Mudupalli