calogica / dbt-expectations

Port(ish) of Great Expectations to dbt test macros
https://calogica.github.io/dbt-expectations/
Apache License 2.0
1.06k stars 130 forks source link

Create expect_column_set_to_be_unique_case_insensitive.sql #138

Closed agusfigueroa-htg closed 2 years ago

agusfigueroa-htg commented 2 years ago

Compares if the unique values of a given column remain unique from a case insensitive perspective. Useful when dealing with CamelCase and lowercase uniqueness within a column. Example: the test would fail if "MasterCard" and "mastercard" are in the same column.

I don't know the guidelines of the commits, but I can take care of adding the corresponding links and description to the README.

clausherther commented 2 years ago

Hi @agusfigueroa-htg, thanks for your PR! The use case here makes sense to me. However, I wonder if we could instead modify expect_column_values_to_be_unique to take in an optional boolean case_sensitive parameter that would accomplish the same thing? We don't have strict contribution guidelines (yet), but it'd be great if you could ultimately add to/update the README. Also, if you could, please take a look at the existing macros for formatting guidelines. Thanks!

agusfigueroa-htg commented 2 years ago

Hey @clausherther, thanks for reviewing this! Question on the comment, as I gave that some thought but still I don’t fully see it.

In this test the column values are not unique, I just want the set of unique values to remain unique case insensitive.

Some examples below

Example 1: “iPhone” “iPhone” expect_column_values_to_be_unique fails expect_column_set_to_be_unique_case_insensitive passes

Example 2: “iPhone” “iphone” expect_column_values_to_be_unique passes expect_column_set_to_be_unique_case_insensitive fails

expect_column_set_to_be_unique_case_insensitive doesn’t make much sense in my eyes as a case-sensitive test - because I am already considering the column set of values, which is unique by definition.

Is there any existing test that comes to your mind to integrate this use case? Otherwise, we can leave it standalone. Let me know what you think.

Regarding the next steps, I can adjust whatever is needed, format it as the other macros and add the documentation to the readme.

Thanks again! :)

clausherther commented 2 years ago

@agusfigueroa-htg what I meant was, we could add a new optional parameter to expect_column_values_to_be_unique, case_sensitive=True and based on that parameter value, implement the logic needed to make the uniqueness test case insensitive in the expect_column_values_to_be_unique test, without needing a standalone test. We're trying to avoid too many similar standalone tests. Does that make sense?

agusfigueroa-htg commented 2 years ago

@clausherther That is feasible but would change the idea of the test in itself as it is not analyzing the uniqueness of the column, but analyzing if the set (unique values of the column) remain unique case insensitive.

Our use case, and what I wonder whether it would come in handy for further users, is to apply this to large tables where the column is not necessarily unique but you wanna ensure that there are no duplicates case insensitive. For instance, when picking the OS column of a sessions table I would expect values to show up there repeatedly (e.g. Windows, Android, iOS, Android, iOS, etc). If, for some data problem, we have sessions with different ways of naming "windows" (e.g. Windows, Android, iOS, Android, windows, iOS, etc) I want the test to fail. Values being repeated is expected (which is why expect_column_values_to_be_unique would fail), what this adds on top is the expectation of that set of values remaining unique from a case insensitive perspective.

I pinged you on slack - maybe we can briefly sync. Thanks, Agus

agusfigueroa-htg commented 2 years ago

After discussing with @clausherther, we aligned on the fact that what we were looking for was a test for consistent casing. This means, that if a column has several repeated values it's fine, as long as the casing is consistent.

--Example of expected values for a given column: 

google
google
google
aws
aws
othervalue
--Example of unexpected values for a given column: 

google
Google
google
aws
aws
othervalue

As in this case the column value "google" is spelled using two different cases ("google" in the first place and "Google" in the second one), the casing is inconsistent. This is the intrinsic purpose of this test and the reason why we decided to make it a standalone test. A flag was added so the test can return the faulty column values if needed.

The final PR is here, including formatting and documentation, feel free to review it and get back to me if needed. Thanks

clausherther commented 2 years ago

@agusfigueroa-htg we'll need to add an integration test here please.

clausherther commented 2 years ago

Hi @agusfigueroa-htg! Is this PR ready to be reviewed? Let me know if there's anything I can do to help get this over the finish line this week.

agusfigueroa-htg commented 2 years ago

Ah yes sorry! When I submitted the last commit I forgot to tag you in a comment. Yes, ready to be reviewed @clausherther . Thanks!

agusfigueroa-htg commented 2 years ago

@clausherther I introduced those changes, may you check? thx!