Closed geo909 closed 2 months ago
That's really interesting.
I think we could resolve this by overriding the default tests or writing our own custom tests, but I think the better place to resolve this is in dbt Core by getting their tests to add a partition filter when the require_partition_filter
is detected.
The challenge is to detect the partition_by
config and create a valid filter based on the data_type
, granularity
and range
.
In this package, we only use date filters with day granularity so it's simpler. We'd just use static_incremental_days
to control the test lookback window like we do with the models.
@geo909, are you comfortable bringing this up as an issue in the dbt-core repository? Or would you like me to do it?
I'd like to see what they say before fixing this in the package. It's easier to do here but better if it's done universally.
@3v-dgudaitis Thank you for the quick reply.
I'm not sure whether I agree that dbt core should have a default partition filter in their tests when require_partition_filter
is detected. To my understanding, one of the key roles of this setting is so that it forces users to take proper care when querying large tables in order to avoid costs. I feel that adding default filters in dbt core, in tests or otherwise, contradicts the very purpose of this setting. Does that make sense to you?
I understand that this is about a custom workaround of mine, but ga4 is usually a lot of data and a good use case for the require_partition_filter
flag, so maybe it's worth having this conversation. But, at the end of the day, I can create custom tests, as you suggested.
First of all, I'm @3v-dgudaitis. I commented from the wrong GitHub account.
The reason that enabling require_partition_filter
causes the tests to error is that the tests query all-time data. By making tests respect partitioning, we limit the cost of testing. The problem that you are seeing is caused by the interaction between dbt require partition filter
and dbt's native test suite.
I think it's okay to assume that the older data has been tested and we should only need to test new or processed data. This has the additional benefit that, should a test failure sneak in that has to be resolved in data collection, you can manually run, dbt run
, on the failed days and then after those days have been processed and data collection is fixed, the tests go back to working.
Another common workflow for me is to discover an error, fix it in data collection, and then write a test to catch it in the future (although it's best to also fix it in the warehouse so your tests will pass). This new test would fail if it ran against all-time data.
I'm curious what @adamribaudo thinks?
@dgitis Just a thought that came up while reading your comment
I think it's okay to assume that the older data has been tested and we should only need to test new or processed data.
I believe that testing does not necessarily adhere to this principle. Consider for instance a 'uniqueness' test; if some values are unique for yesterday's data, this does not necessarily mean that they are unique over a longer period of time, such as last year.
That said, and the more I think about it, I still don't believe dbt core should enforce partition filtering when require_partition_filter
is enabled. When this flag is set, this means that the user is forced to make concious filtering decisions, even when writing tests. But the "user" in this case, could be the package developer. In our case, maybe we could add a filter in the tests using the ga4.start_date
variable when the require_partition_filter
is detected.
This is just a thought, I don't know how hard this is to implement and what other consequences it may have in your code.
The only issue is that if dbt does not enforce partition filtering on tests, then their tests are incompatible with their require_partition_filter
. Two of their features don't work together. That seems like their problem.
Admittedly, it's open source software so it's a problem for any user who decides to fix it, but I'm not qualified to fix it because I don't really work much with other warehouses. This is why I want to open an issue in dbt Core where we can pull in experience from non-BQ warehouses and get a wider variety of perspectives on the nature of testing and possible fixes for this incompatibility.
The only issue is that if dbt does not enforce partition filtering on tests, then their tests are incompatible with their require_partition_filter. Two of their features don't work together. That seems like their problem.
Well, the user is forced to use a "where" clause in their test, and then it will work. When require_partition_filter
is true, it forces users (even package developers) to filter, even if this is in designing their tests. Anyway, this is my understanding and I could certainly be on the wrong path. I can open a ticket with dbt and see where the discussion goes there. In any case, I'll also just make my own tests so this is not a big problem.
Thank you again for the effort and the great package.
Hi and thank you for the great package,
GA4 is a lot of data and, in my organization, it would be a really good idea to require a partition filter in order to avoid accidental costs by careless queries.
I can add this configuration in my
dbt_project.yml
like so:but the problem is that tests won't run anymore:
Is this something we can work around? Or, is there any chance we could find a way to integrate bigquery configurations like that in the package?