sqlserver__type_string is too short

infused-kim commented 3 years ago

Hi Guys,

First of all, thank you for creating this package. Your efforts are much appreciated.

I noticed the type_string adapter looks like this:

{%- macro sqlserver__type_string() -%}

VARCHAR(900)

{%- endmacro -%}

Is there anything that speaks against using VARCHAR(8000). By limiting it to 900, there could be instances where strings are truncated.

That's particularly bad since type_string is used in surrogate_key. For example, I use hashes of URLs as keys and this would lead to duplicate surrogate keys, because some URLs are longer than 900 characters.

infused-kim commented 3 years ago

After looking around the closed PRs I realized that this has been discussed already in https://github.com/dbt-msft/tsql-utils/pull/14 and https://github.com/dbt-msft/tsql-utils/pull/15.

varchar(900) was picked, because this is the max size that allows indexing, which does make sense to me.

That being said, I think using 8000 would still be better choice, because this macro is used by surrogate_key() and not called directly. And if duplicate surrogate keys get generated, because long strings were cut off, that would lead to a very hard to debug condition.

If someone needed indexing on a column that used a macro that used type_string all they would have to do is recast the column. But if someone is using columns with values longer than 900 right now, there is no good solution except for duplicating the surrogate_key macro and all its dependencies.

Alternatively, perhaps it is possible to allow users to create their own macro versions that overwrite sqlserver__* macros?

Is there a way to add something like dbt_utils_dispatch_list: ['self', 'tsql_utils']?

alittlesliceoftom commented 3 years ago

Hi Kim,

Welcome to the T-Sql utils project, lovely to have you here, and some wondefully written issues! Our ideas around the string decision are not fixed in stone, merely the best first compromise, so it's great to have your ideas coming in!

Currently this project is setup so that you install dbt_utils and it dispatches to tsql_utils - i.e in install instructions: https://github.com/dbt-msft/tsql-utils#installation-instructions

Tell dbt-utils to also look for the tsql-utils macros by adding this section to your dbt_project.yml

vars:
  dbt_utils_dispatch_list: ['tsql_utils']

So your question is, would doing

dbt_utils_dispatch_list: ['self', 'tsql_utils']

basically grant the local macros precedence over the tsql ones.

That certainly seems a reasonable approach, but I'm not sure what the right syntax is for it. Certainly the docs imply it does a lookup against listed packages in order (https://docs.getdbt.com/reference/dbt-jinja-functions/adapter/#dispatch), but perhaps it's the name you've given your project, rather than a python 'self'. It's also possible it just gets the right one automatically. I usually look to @jtcohen6 for a steer on that sort of thing.

What we were looking for in the string macro was a sensible compromise, that covered as many use cases as possible, but for as little code change as possible, and without undue performance dips. There are likely a number of ways to solve this for everyone , let me start the list with my first idea:

Create a tsql version of surrogate_key that accepts an arg for max string length and then does it's type casting accordingly, defaulting to using string where not provided. This might be easiest to do with something like (excuse pseudocode).
```
if varchar_length:
dispatch dbt_utils.surrogate_key([field1,field2...])
else:
dispatch tsql_utils.surrogate_key_long_string([field1,field2...], n)
```
And then we can write surrogate_key_long_string once, and should improvements to dbt_utils version happen later, they'll still pass through by default - that is we'll only be passing people to our unique code when it's the only way to solve their use case.

How does that sound?

dataders commented 3 years ago

thanks @infused-kim for opening the issue! it makes all the work worthwhile knowing that people are not only using it, but contributing ideas for how to make it better!

Short-term, the easiest solution is to override a macro from a package is to just re-implement it with the same exact name in your project's macros/ dir. Basically the behavior @alittlesliceoftom is proposing already happens with project-level macros. @infused-kim can you try this out to see if it works/enables you?

{%- macro sqlserver__type_string() -%}
VARCHAR(8000)
{%- endmacro -%}

Medium- and long-term we have a lot of options. From:

changing sqlserver__type_string
removing surrogate_key's dependence on sqlserver__type_string
creating a new macro, surrogate_key_long_string that does what we want.

infused-kim commented 3 years ago

Thanks for the responses guys. Sorry for not getting back to you sooner. I haven't had time to update to the latest dbt version and try tsql-utils until today.

The workaround

Let me start with how I solved the problem. You guys pointed me into the right direction of creating my own adapter implementation. Thank you for that.

Turns out dbt-utils doesn't dispatch to the project by default, you do have to add it to the dbt_utils_dispatch_list and it has to be before tsql_utils

In case someone else wants to change the behavior, here is how you can do it:

In dbt_project.yml set...:

vars:
  dbt_utils_dispatch_list: ['your_dbt_project_name', 'tsql_utils']

Since your project is listed first, DBT utils will look for adapter macros in your project first and only if it can't find one look in tsql_utils. This effectively allows you to override the default tsql-utils behavior.

To override an adapter, simply create a macros/sqlserver_adaptors.sql:

{%- macro sqlserver__type_string() -%} varchar(8000) {%- endmacro -%}

Long-term solution

Long term I think the best way to solve it, would be to create a tsql_utils.surrogate_key() function that takes care of this problem, as you suggested.

Recently, I actually ran into a similar problem. I loaded data from google analytics and suddenly kept getting rows that had a duplicate surrogate key.

It turned out that GA was returning data as unicode strings. But since it was casted into varchar with surrogate_key(), it converted special characters into their closest non-unicode equivalent, such asè into e and generating the key on that.

And this lead to duplicate keys if there was another row that used the normal e in the same dimension.

My solution was to create a new surrogate_key function that used nvarchar(4000).

I'm not super familiar with modern data warehouses like big query, but it seems like they don't have this issue. I suspect there is only one string type that applies to both unicode and non-unicode and has no length.

For that reason, I think it would make sense to introduce a tsql_utils.surrogate_key(value_list, column_type='varchar(900)') function.

This way we have a function that takes into account the unique needs of T-SQL, but at the same time we maintain the macros that make dbt_utils.surrogate_key work so that existing code bases or new users who learned from the default dbt documentation can use it.

If they run into issues, need a different column size or a different type, they can use the function from tsql_utils.

For that use case I would suggest to change type_string that is used in the standard dbt_utils.surrogate_key() to varchar(8000) as that would reduce the chance of unexpected duplicates.

The default of tsql_utils.surrogate_key could continue to be varchar(900), since it would be easy to adjust for those who need a longer size or nvarchar type.

What do you think of this?

dbt-msft / tsql-utils

sqlserver__type_string is too short #25

The workaround

Long-term solution