Velir / dbt-ga4

dbt Package for modeling raw data exported by Google Analytics 4. BigQuery support, only.
MIT License
289 stars 128 forks source link

The new items format for custom parameters causes base models to error #281

Closed dgitis closed 7 months ago

dgitis commented 8 months ago

Here's the error that I'm seeing.

12:55:50  Completed with 1 error and 0 warnings:
12:55:50  Database Error in model base_ga4__events (models/staging/ga4/base/base_ga4__events.sql)
12:55:50    Value has type ARRAY> which cannot be inserted into column items, which has type ARRAY> at [171:1194]

And here's the new schema.

image

I first noticed this on a client site that hasn't been touched in a few weeks so this is almost certainly the cause of the error.

cambalzer commented 8 months ago

I also started getting this error at 4AM CT today. Any workarounds?

adamribaudo-velir commented 8 months ago

I haven't tried it myself, but you can override the base_select_source macro by creating a macro in your project called default__base_select_source. In that macro, you could drop the items column.

We're looking into a solution now.

dgitis commented 8 months ago

@cambalzer try running dbt run --select base_ga4__events --full-refresh. That seems to be fixing things for most people now.

I've encountered some edge cases, and solutions, that I'll be documenting momentarily.

cambalzer commented 8 months ago

That worked. Thank you for the quick response!

dgitis commented 8 months ago

Multi-site with the +full-refresh: false on your base model With full-refresh disabled on a multi-site installation (this is recommended because multi-site relies on long-running clone jobs that will cause timeouts if you refresh them), running dbt run --select base_ga4__events --full-refresh while full refreshes are disabled seems to fix the issue.

image

Here are the results of running with those settings on the one multi-site installation that I have access to.

image

And here's my re-run of the daily job.

image

The warning is null page_location fields.

I don't know why this worked, so if you try this and it doesn't work for you, please enable full refreshes on your base model, pray that you don't timeout, and report back here.

You may have to manually edit your models to prevent the clone jobs from running in the pre-hook so that you can reprocess the base model as the clone should have run fine.

Single site with +full-refresh: false on the base model I also have some high-event-volume sites where I've suppressed full-refreshes on the base model.

These sites return the same error in the initial issue report when running dbt run --select base_ga4__events --full-refresh but allowing full refreshes and then rerunning that command fixes the error.

dgitis commented 8 months ago

Update, the multi-site with +full-refresh: false on the base model errored out again.

To fix, I had to disable the model in the package in the dbt_project.yml file:

      base:
        base_ga4__events:
          +full_refresh: false
          +enabled: false

Then I copied the base model from the package to the project and deleted the pre-hook on line 8.

        pre_hook="{{ ga4.combine_property_data() }}" if var('property_ids', false) else "",

This let me rebuild the base model with a full-refresh and not clone all of the old partitions which would have taken forever.

I then undid those changes and reprocessed yesterday's data to ensure that the clones in the prehook are up to date.

elyobo commented 8 months ago

If you wanted to future proof this, then selecting the specific fields that you want from the array in items would ensure that schema changes from upstream don't automatically cause breaking changes in incremental builds (at the cost of missing any new fields that turn up, but they could be added in future breaking change releases).

birdieorbust commented 8 months ago

A full refresh seems like a brute force approach especially for big accounts where this is costly. Could you utilise DBTs "on_schema_change" configuration block for incremental models? I am not sure this accounts for nested fields.

dgitis commented 8 months ago

The full refresh is the fix that people can do right away. As long as you only refresh the base table, the data processing is not too bad.

dbt's on_schema_change does not work with nested structures, so that's no help.

I think we want to do as @elyobo recommends and explicitly select fields in the various arrays. Then, whenever Google adds a new field, then users will be expected to add that field using the bq update command, possibly reprocessing whichever days they want added using static_incremental_days, when we update the package to support the new columns. This seems to me to be the best solution.