GoogleCloudPlatform / professional-services-data-validator

Utility to compare data between homogeneous or heterogeneous environments to ensure source and target tables match
Apache License 2.0
385 stars 108 forks source link

Snowflake to BigQuery comparison bugs #1160

Closed henrihem closed 2 weeks ago

henrihem commented 1 month ago

There are a few small bugs, which have been reported by our users.

Bug 1 / needs clarification how to handle

This case is tricky, since project default_time_zone has been set to Europe/Helsinki. That might be the core cause of this issue.

Table schema: dt TIMESTAMP_NTZ(9) in Snowflake dt DATETIME in BigQuery

When running concat comparison, the "dt" is casted as:

FORMAT_TIMESTAMP('%Y-%m-%d %H:%M:%S', TIMESTAMP(dt), 'UTC') AS `cast__dt`

In BigQuery

to_char("DT", 'YYYY-MM-DD HH24:MI:SS') AS "cast__dt"

In Snowflake.

This creates different value, since in BigQuery, the "dt" is casted as UTC, thus resulting in different value than in Snowflake, due to default_time_zone being other than UTC.

Bug 2

Scandic characters are not handled properly from Snowflake. There is a description field in Snowflake, which can contain text, for example "Testiä".

In comparisons these result in different strings: Snowflake:

TESTIĂ„

BigQuery:

TESTIÄ
helensilva14 commented 1 month ago

Hi @henrihem!

Could you please share with us the DVT commands or configs you're using for both scenarios? It would really help us to assess the possibilities

helensilva14 commented 1 month ago

I did some search about Bug 1: maybe if you add a new DVT connection with this additional parameter "connect_args": {"timezone": "UTC"}}, or either temporarily change your current connection JSON file just for the test (located in <USERNAME>/.config/google-pso-data-validator/<CONNECTION_NAME>.connection.json), you can be able to force the Snowflake connection to read values as UTC, not sure if that would work, but we could give it a shot.

It would look like this:

{"source_type": "Snowflake", "secret_manager_type": null, "secret_manager_project_id": null, "user": "XXXXX", "password": "XXXXX", "account": "XXXXXX", "database": "XXXXX", "connect_args": {"timezone": "UTC"}}

Reference: https://docs.snowflake.com/en/developer-guide/python-connector/python-connector-api#label-snowflake-connector-methods-connect:~:text=timezone,session%20time%20zone.

henrihem commented 1 month ago

Here are the commands:

Bug 1

I added the connect_args, but the situation seems the same. I think it is because that field is in fact TIMESTAMP_NTZ in Snowflake, meaning session timezone will not make a difference.

As said: dt TIMESTAMP_NTZ(9) in Snowflake dt DATETIME in BigQuery

In the below example, "dt" and "co" columns are concatenated, where "co" is just some string field.

data-validation validate row \
  -sc sf_conn \
  -tc bq_conn \
  -tbls pub.TBL=pub.TBL  \
  --concat dt,co \
  --primary-keys pk \
  -rr \
  -rbs 100

Bug 2

Same command as in previous step.

In this case desc is STRING field in Snowflake and BigQuery. This string field can contain scandic characters, such as ä and ö.

data-validation validate row \
  -sc sf_conn \
  -tc bq_conn \
  -tbls pub.TBL1=pub.TBL1  \
  --concat desc,co \
  --primary-keys pk \
  -rr \
  -rbs 100
nj1973 commented 3 weeks ago

@henrihem, I believe I have a fix for the time zone issue, thanks for pointing this problem out.

You can review the proposed change on this pull request: https://github.com/GoogleCloudPlatform/professional-services-data-validator/pull/1174