Closed franklinWhaite closed 1 month ago
Hi @franklinWhaite, I think I've reproduced this but would like to be sure I am looking at the same problem you are experiencing.
Can you find a sample problem row and validate it using --concat
instead of --hash
? Do you then get a match?
Also noting that we have a broken link in the installation readme:
Hi @nj1973,
Thanks for looking into this, I have a similar issue when using concat. The validation continues to fail.
Hi,
The issue is there are multiple representations for a character in binary - UTF8 (which is a representation of Unicode), LATIN etc. ASCII representation is nearly universal. Our code worked as long as representation was ASCII - which is a subset of both LATIN and UTF8. When you go beyond ASCII, DVT does not work correctly.
The fix is to ensure we are computing the message digest for UTF8 representation - that is the representation that BQ uses and Unicode is a superset of LATIN.
When you are doing concat etc - it is more complicated because you cannot display things in our terminal if your terminal does not accept that representation.
Sundar Mudupalli
When running a row validation, the hash for a string in BigQuery is different to the hash of the same string in Teradata if the string has an accented vowel. (I suspect the issue might also occur for any non ascii char).
This causes a failed row validation for strings that are identical.