GoogleCloudPlatform / professional-services-data-validator

Utility to compare data between homogeneous or heterogeneous environments to ensure source and target tables match
Apache License 2.0
400 stars 114 forks source link

Row Hash validation of TD to BQ with non latin characters fails #1225

Closed sundar-mudupalli-work closed 1 month ago

sundar-mudupalli-work commented 1 month ago

If a Teradata column contains a string which has Unicode Characters beyond the Latin character set, DVT fails with the following error:

 [Version 20.0.0.12] [Session 8285] [Teradata Database] [Error 6706] The string contains an untranslatable character.

If the row with the offending character is excluded from the validation, the validation succeeds.

The problem is with this line

 return f"rtrim(hash_sha256(TransUnicodeToUTF8(TRANSLATE({arg} using latin_to_unicode WITH ERROR))))") 

This line assumes that all characters can be converted into Latin - which is then converted to Unicode and UTF8. When a non latin character is presented Teradata throws the error. One solution is to have the user provide an option to DVT --beyond-latin or -byl which would indicate that table columns are Unicode and/or some table columns have characters beyond the latin character set. In that situation, the backend can generate the following SQL for Teradata

return f"rtrim(hash_sha256(TransUnicodeToUTF8({arg})))")

If we did the above when there are columns with latin encoding and latin character set, the validation will fail.

Sundar Mudupalli