Velir / dbt-ga4

dbt Package for modeling raw data exported by Google Analytics 4. BigQuery support, only.
MIT License
289 stars 128 forks source link

Possibility to enable BigQuery configuration such as partition_filter_required #298

Closed geo909 closed 2 months ago

geo909 commented 4 months ago

Hi and thank you for the great package,

GA4 is a lot of data and, in my organization, it would be a really good idea to require a partition filter in order to avoid accidental costs by careless queries.

I can add this configuration in my dbt_project.yml like so:

models:
  ga4:
      base:
        base_ga4__events:
          +require_partition_filter: true

but the problem is that tests won't run anymore:

09:14:38  Completed with 1 error and 0 warnings:
09:14:38  
09:14:38  Database Error in test unique_stg_ga4__events_event_key (models/staging/stg_ga4__events.yml)
09:14:38    Cannot query over table 'bigquery-xxxxx.dbt_qa.base_ga4__events' without a filter over column(s) 'event_date_dt' that can be used for partition elimination

Is this something we can work around? Or, is there any chance we could find a way to integrate bigquery configurations like that in the package?

3v-dgudaitis commented 4 months ago

That's really interesting.

I think we could resolve this by overriding the default tests or writing our own custom tests, but I think the better place to resolve this is in dbt Core by getting their tests to add a partition filter when the require_partition_filter is detected.

The challenge is to detect the partition_by config and create a valid filter based on the data_type, granularity and range.

In this package, we only use date filters with day granularity so it's simpler. We'd just use static_incremental_days to control the test lookback window like we do with the models.

@geo909, are you comfortable bringing this up as an issue in the dbt-core repository? Or would you like me to do it?

I'd like to see what they say before fixing this in the package. It's easier to do here but better if it's done universally.

geo909 commented 4 months ago

@3v-dgudaitis Thank you for the quick reply.

I'm not sure whether I agree that dbt core should have a default partition filter in their tests when require_partition_filter is detected. To my understanding, one of the key roles of this setting is so that it forces users to take proper care when querying large tables in order to avoid costs. I feel that adding default filters in dbt core, in tests or otherwise, contradicts the very purpose of this setting. Does that make sense to you?

I understand that this is about a custom workaround of mine, but ga4 is usually a lot of data and a good use case for the require_partition_filter flag, so maybe it's worth having this conversation. But, at the end of the day, I can create custom tests, as you suggested.

dgitis commented 4 months ago

First of all, I'm @3v-dgudaitis. I commented from the wrong GitHub account.

The reason that enabling require_partition_filter causes the tests to error is that the tests query all-time data. By making tests respect partitioning, we limit the cost of testing. The problem that you are seeing is caused by the interaction between dbt require partition filter and dbt's native test suite.

I think it's okay to assume that the older data has been tested and we should only need to test new or processed data. This has the additional benefit that, should a test failure sneak in that has to be resolved in data collection, you can manually run, dbt run, on the failed days and then after those days have been processed and data collection is fixed, the tests go back to working.

Another common workflow for me is to discover an error, fix it in data collection, and then write a test to catch it in the future (although it's best to also fix it in the warehouse so your tests will pass). This new test would fail if it ran against all-time data.

I'm curious what @adamribaudo thinks?

geo909 commented 4 months ago

@dgitis Just a thought that came up while reading your comment

I think it's okay to assume that the older data has been tested and we should only need to test new or processed data.

I believe that testing does not necessarily adhere to this principle. Consider for instance a 'uniqueness' test; if some values are unique for yesterday's data, this does not necessarily mean that they are unique over a longer period of time, such as last year.

That said, and the more I think about it, I still don't believe dbt core should enforce partition filtering when require_partition_filter is enabled. When this flag is set, this means that the user is forced to make concious filtering decisions, even when writing tests. But the "user" in this case, could be the package developer. In our case, maybe we could add a filter in the tests using the ga4.start_date variable when the require_partition_filter is detected.

This is just a thought, I don't know how hard this is to implement and what other consequences it may have in your code.

dgitis commented 4 months ago

The only issue is that if dbt does not enforce partition filtering on tests, then their tests are incompatible with their require_partition_filter. Two of their features don't work together. That seems like their problem.

Admittedly, it's open source software so it's a problem for any user who decides to fix it, but I'm not qualified to fix it because I don't really work much with other warehouses. This is why I want to open an issue in dbt Core where we can pull in experience from non-BQ warehouses and get a wider variety of perspectives on the nature of testing and possible fixes for this incompatibility.

geo909 commented 4 months ago

The only issue is that if dbt does not enforce partition filtering on tests, then their tests are incompatible with their require_partition_filter. Two of their features don't work together. That seems like their problem.

Well, the user is forced to use a "where" clause in their test, and then it will work. When require_partition_filter is true, it forces users (even package developers) to filter, even if this is in designing their tests. Anyway, this is my understanding and I could certainly be on the wrong path. I can open a ticket with dbt and see where the discussion goes there. In any case, I'll also just make my own tests so this is not a big problem.

Thank you again for the effort and the great package.