GoogleCloudPlatform / professional-services-data-validator

Utility to compare data between homogeneous or heterogeneous environments to ensure source and target tables match
Apache License 2.0
394 stars 110 forks source link

row validation: Long text are being cropped to 30 characters for hash validation on SQL Server #990

Closed nj1973 closed 10 months ago

nj1973 commented 11 months ago

When running hash validations columns are cast to VARCHAR, then concatenated and finally a hash is produced.

We've had a customer issue reported for nvarchar(500) and nvarchar(2000) columns. The cast to varchar uses a length of 30 by default so is trimming the text. We should consider casting to varchar(max) instead (or another technique if that works out to be easier).

Varchar docs: https://learn.microsoft.com/en-us/sql/t-sql/data-types/char-and-varchar-transact-sql?view=sql-server-ver16 Cast docs: https://learn.microsoft.com/en-us/sql/t-sql/functions/cast-and-convert-transact-sql?view=sql-server-ver16

Examples from sqlcmd

Notice the x (at position 31) is cropped:

1> select concat(cast('123456789012345678901234567890x' as varchar),cast('123456789012345678901234567890x' as varchar));
2> go

------------------------------------------------------------
123456789012345678901234567890123456789012345678901234567890

Notice the x (at position 31) is included:

1> select concat(cast('123456789012345678901234567890x' as varchar(max)),cast('123456789012345678901234567890x' as varchar(max)));
2> go

----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
123456789012345678901234567890x123456789012345678901234567890x
nehanene15 commented 11 months ago

This will involve updating the following line to reference VARCHAR(max) instead: https://github.com/GoogleCloudPlatform/professional-services-data-validator/blob/develop/third_party/ibis/ibis_mssql/datatypes.py#L34

nj1973 commented 11 months ago

I've attempted to reproduce this with these columns:

,   col_nvarchar      nvarchar(max)
,   col_nvarchar_500  nvarchar(500)

And I do not see any CAST to VARCHAR in a row validation SQL:

data-validation -v validate row -sc sqlserver -tc sqlserver -tbls dvt_test.dvt_mssql2pg_types \
--primary-keys=id --hash='id,col_nvarchar_500,col_nvarchar'
...
09/15/2023 10:03:11 AM-INFO: -- ** Source Query ** --
09/15/2023 10:03:11 AM-INFO: WITH t0 AS
(SELECT t6.id AS id, t6.col_varchar AS col_varchar, t6.col_nvarchar AS col_nvarchar, t6.col_nvarchar_500 AS col_nvarchar_500, t6.col_nchar_2 AS col_nchar_2, t6.col_ntext AS col_ntext, t6.col_smallmoney AS col_smallmoney, t6.col_money AS col_money, t6.col_smalldatetime AS col_smalldatetime, t6.col_time AS col_time, t6.col_uuid AS col_uuid, CAST(t6.id AS VARCHAR) AS cast__id, t6.col_nvarchar AS cast__col_nvarchar, t6.col_nvarchar_500 AS cast__col_nvarchar_500
FROM dvt_test.dvt_mssql2pg_types AS t6),
t1 AS
...
divyapandian5 commented 11 months ago

Hi @nj1973 Attaching the ddl which has source and target table definitions. Fields like Description , shortDescription is where we saw issues. Also attaching command i used and the source and target queries

toReplicate.zip

nj1973 commented 11 months ago

I've been able to reproduce the problem, thanks for the DDL. I created this table:

CREATE TABLE [dbo].[tblIssue990](
    [ID] [bigint] IDENTITY(1,1) NOT FOR REPLICATION NOT NULL,
    [ShortDescription] [nvarchar](1000) NOT NULL,
    [ShortDescription2] [nvarchar](1000)
);

If I validate ShortDescription I get the cast to VARCHAR. If I validate ShortDescription2 I do not.

The presence of the NOT NULL constraint is triggering the problem, presumably because the data type is then prefixed with !. I haven't yet been able to figure out where in the code the cast is included/bypassed based on data type.

nj1973 commented 11 months ago

Using dbo.tblIssue990 I saved output to a config file and the contents were identical for ShortDescription vs ShortDescription2, except for column names, therefore the problem is not in the construction of the Ibis expressions, it must come later in the process.

nj1973 commented 11 months ago

The cast is bypassed for the not null column in this code in ibis_addon/api.py:

def cast(self, target_type: dt.DataType) -> Value:
    """Override ibis.expr.api's cast method.
    This allows for Timestamp-typed columns to be cast to Timestamp, since Ibis interprets some similar but non-equivalent types (eg. DateTime) to Timestamp (GitHub issue #451).
    """
    # validate
    op = ops.Cast(self, to=target_type)

    if op.to == self.type() and not op.to.is_timestamp():
        # noop case if passed type is the same
        return self
...

In the test case both op.to and self.type() are "string" for ShortDescription2. But for ShortDescription self.type() is "!string" which introduces an unnecessary cast.

This makes me think we actually have two problems here:

  1. Nullable and non-nullable should not be treated differently just because there's a ! in the data type
  2. Irrespective of 1 we still need a MAX in the cast to VARCHAR because, in theory, we could be casting a numeric value with more than 30 digits to VARCHAR which would also be truncated
nj1973 commented 11 months ago

I think @renzokuken was on the right lines with the change on this branch based on this comment in SQL Alchemy code in sqlalchemy/dialects/mssql/base.py:

MAX on VARCHAR / NVARCHAR
-------------------------

SQL Server supports the special string "MAX" within the
:class:`_types.VARCHAR` and :class:`_types.NVARCHAR` datatypes,
to indicate "maximum length possible".   The dialect currently handles this as
a length of "None" in the base type, rather than supplying a
dialect-specific version of these types, so that a base type
specified such as ``VARCHAR(None)`` can assume "unlengthed" behavior on
more than one backend without using dialect-specific types.

Therefore it is strange that the change is not having the desired effect.

In sqlalchemy/dialects/mssql/base.py class MSTypeCompiler there is a visit_VARCHAR method which uses "max" if a type has no length.

However, our casts appear to go through sqlalchemy/sql/compiler.py class GenericTypeCompiler.visit_VARCHAR which does not support "max".

I've not yet figured out why this is.

nj1973 commented 11 months ago

I have added data to my test case and a BigQuery table:

CREATE TABLE [dbo].[tblIssue990](
    [ID] [bigint] IDENTITY(1,1) NOT FOR REPLICATION NOT NULL,
    [ShortDescription] [nvarchar](1000) NOT NULL,
    [ShortDescription2] [nvarchar](1000)
);
INSERT INTO [dbo].[tblIssue990] ([ShortDescription],[ShortDescription2])
VALUES ('1234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890',
'1234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890');

BigQuery:

create table dvt_test.tblissue990
(id int64,ShortDescription string,ShortDescription2 string);
insert into dvt_test.tblissue990
values (1,'1234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890',
'1234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890');

I then ran a row validation and it succeeded, with data > 100 characters:

data-validation -v validate row -sc sqlserver -tc bq -tbls dbo.tblIssue990=dvt_test.tblissue990 --primary-keys=id --hash='ID,ShortDescription'

╒═══════════════════╤═══════════════════╤═════════════════════╤══════════════════════╤══════════════════════════════════════════════════════════════════╤══════════════════════════════════════════════════════════════════╤══════════════════╤═════════════════════╤══════════════════════════════════════╕
│ validation_name   │ validation_type   │ source_table_name   │ source_column_name   │ source_agg_value                                                 │ target_agg_value                                                 │ pct_difference   │ validation_status   │ run_id                               │
╞═══════════════════╪═══════════════════╪═════════════════════╪══════════════════════╪══════════════════════════════════════════════════════════════════╪══════════════════════════════════════════════════════════════════╪══════════════════╪═════════════════════╪══════════════════════════════════════╡
│ hash__all         │ Row               │ dbo.tblIssue990     │ hash__all            │ e25df97cf5142ab3afe61be0e6c14870bccc709499ff8c43888afabba212ed0d │ e25df97cf5142ab3afe61be0e6c14870bccc709499ff8c43888afabba212ed0d │                  │ success             │ 39153d4d-fbe3-4a67-a479-d28fccc6d14c │
╘═══════════════════╧═══════════════════╧═════════════════════╧══════════════════════╧══════════════════════════════════════════════════════════════════╧══════════════════════════════════════════════════════════════════╧══════════════════╧═════════════════════╧══════════════════════════════════════╛

And concat shows the full data returned:

data-validation -v validate row -sc sqlserver -tc bq -tbls dbo.tblIssue990=dvt_test.tblissue990 --primary-keys=id --concat='ID,ShortDescription'

╒═══════════════════╤═══════════════════╤═════════════════════╤══════════════════════╤═══════════════════════════════════════════════════════════════════════════════════════════════════════╤═══════════════════════════════════════════════════════════════════════════════════════════════════════╤══════════════════╤═════════════════════╤══════════════════════════════════════╕
│ validation_name   │ validation_type   │ source_table_name   │ source_column_name   │                                                                                      source_agg_value │                                                                                      target_agg_value │ pct_difference   │ validation_status   │ run_id                               │
╞═══════════════════╪═══════════════════╪═════════════════════╪══════════════════════╪═══════════════════════════════════════════════════════════════════════════════════════════════════════╪═══════════════════════════════════════════════════════════════════════════════════════════════════════╪══════════════════╪═════════════════════╪══════════════════════════════════════╡
│ concat__all       │ Row               │ dbo.tblIssue990     │ concat__all          │ 11234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890 │ 11234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890 │                  │ success             │ ced960b8-d3d2-4c87-bc08-00a9933b75d0 │
╘═══════════════════╧═══════════════════╧═════════════════════╧══════════════════════╧═══════════════════════════════════════════════════════════════════════════════════════════════════════╧═══════════════════════════════════════════════════════════════════════════════════════════════════════╧══════════════════╧═════════════════════╧══════════════════════════════════════╛

I do still see incorrect SQL on screen, CAST(t6."ShortDescription" AS VARCHAR) but I think that is a different issue. We need to go back to the beginning and try to re-understand the actual problem.

nj1973 commented 11 months ago

I've added a SQL Server specific cast override which ensures we include the MAX keyword. But seeing as I cannot reproduce the issue myself I've asked the original reporter to test.

nj1973 commented 11 months ago

As for the different behaviour for nullable vs non-null columns, this is caused by this expression in ibis_addon/api.py:

    if op.to == self.type() and not op.to.is_timestamp():
         # noop case if passed type is the same
         return self

self.type() is an Ibis type expression that internally holds a value prefixed with ! for not-null columns, e.g. !string for String columns. op.to is always the nullable variant. Because of this only nullable columns drop into the no-op branch and bypass a cast to string.

We could fix this with a simple hack like this:

     """
+    def same_type(from_type, to_type) -> bool:
+        # The data type for Non-nullable columns if prefixed with "!", this is causing deviations
+        # between nullable and non-nullable columns. The second comparison below is catering for this.
+        return bool(
+            from_type == to_type
+            or str(from_type).lstrip("!") == str(to_type).lstrip("!")
+        )
+
     # validate
     op = ops.Cast(self, to=target_type)

-    if op.to == self.type() and not op.to.is_timestamp():
+    if same_type(op.to, self.type()) and not op.to.is_timestamp():
         # noop case if passed type is the same

but this feels a bit of a hack. I wonder if the exclamation mark should be dropped earlier in the process, we only use it for schema validation anyway.

Therefore I propose not to make the above hacky change and would like some input from @nehanene15.

nehanene15 commented 10 months ago

I think op.to would be equivalent to the target_type provided in the function parameters: def cast(self, target_type: dt.DataType) -> Value

So I think the issue lies within our cast to 'string' when we do row hash validation here rather than to '!string ' if the original column is required. We assume that the cast will account for checking the same type regardless of nullability.

Due to this, I think your approach makes the most sense.

divyapandian5 commented 10 months ago

Thanks Neha,

Just that, we are still seeing rows failing while using concat and we could not find the differences during manual comparison. Would help if we know why it's failing on concat. OR if we could resolve the hash, that would help too. The client will be using the tool on their own during their prod migration and would help them if this gets resolved, and we could have them use concat or hash.

Divya Veerapandian

Cloud Data Architect

Global Delivery Center

On Mon, Oct 16, 2023 at 11:25 PM Neha Nene @.***> wrote:

I think op.to would be equivalent to the target_type provided in the function parameters: def cast(self, target_type: dt.DataType) -> Value

So I think the issue lies within our cast to 'string' when we do row hash validation here https://github.com/GoogleCloudPlatform/professional-services-data-validator/blob/develop/data_validation/query_builder/query_builder.py#L378 rather than to '!string ' if the original column is required. We assume that the cast will account for checking the same type regardless of nullability.

Due to this, I think your approach makes the most sense.

— Reply to this email directly, view it on GitHub https://github.com/GoogleCloudPlatform/professional-services-data-validator/issues/990#issuecomment-1764989776, or unsubscribe https://github.com/notifications/unsubscribe-auth/AU2O4V4NWFPVR5F52TYG2ODX7VYIBAVCNFSM6AAAAAA4WZ7WD2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTONRUHE4DSNZXGY . You are receiving this because you commented.Message ID: <GoogleCloudPlatform/professional-services-data-validator/issues/990/1764989776 @github.com>

nj1973 commented 10 months ago

I think validating this data is going to be very challenging.

Looking through the validation failures there are plenty of multibyte characters to deal with in there, it looks like this was intentional based on the surrounding text. For example ID 1125 has a check mark which is being lost on the SQL Server side, presumably when we cast to VARCHAR. ID 1192 has a pumpkin emoji, I can't imagine what that looks like when cast to VARCHAR.

I can see that we would like to support the full range of unicode symbols but to do that we need to take a step back and consider our comparison expressions across all SQL engines, e.g. BigQuery, Spanner, Oracle, MySQL, etc in addition to SQL Server and PostgreSQL.

I think I need to split this issue into two issues:

  1. Inconsistency between nullable and non-nullable columns. There is a fix for this on this branch.
  2. Full unicode support and testing

There is also a "fix" on this branch for the initially reported issue of the missing MAX keyword when casting to VARCHAR. I'm not sure if that is actually needed or not.

nj1973 commented 10 months ago

Spun off the null vs not-null change to this issue: https://github.com/GoogleCloudPlatform/professional-services-data-validator/issues/1036

nj1973 commented 10 months ago

Abandoning the issue for the time being as it is not clear if we need the MAX keyword in casts or not. I could not reproduce a problem.