[Improvement] too many partitions in int_zendesk__field_history_pivot on bigquery

jens-koster commented 1 year ago

Is there an existing issue for this?

[X] I have searched the existing issues

Describe the issue

I am getting partly the same error as reported in issue https://github.com/fivetran/dbt_zendesk/issues/39, which is closed and merged in https://github.com/fivetran/dbt_zendesk/pull/47

I think the problem was fixed for the int_zendeskfield_calendar_spine model but not for int_zendesk__field_history_pivot. The fix introduced a config variable, ticket_field_history_timeframe_years to limit the number of partitions, but that config variable is not used in int_zendeskfield_history_pivot. I believe all other partitioned models use the date spine and get their date range limited by that.

I hacked the model and added the config variable to the where condition in the first cte, it seems to do the trick. however I am not 100% sure what side effects this might have in the data.

with field_history as (

    select
        ticket_id,
        field_name,
        valid_ending_at,
        valid_starting_at

        --Only runs if the user passes updater fields through the final ticket field history model
        {% if var('ticket_field_history_updater_columns') %}
        ,
        {{ var('ticket_field_history_updater_columns') | join (", ")}}

        {% endif %}

        -- doing this to figure out what values are actually null and what needs to be backfilled in zendesk__ticket_field_history
        ,case when value is null then 'is_null' else value end as value

    from {{ ref('int_zendesk__field_history_enriched') }}
    where
        cast( {{ dbt.date_trunc('day', 'valid_starting_at') }} as date) >=
            {{ dbt.dateadd('year', - var('ticket_field_history_timeframe_years', 50), dbt.current_timestamp_backcompat() ) }}

    {% if is_incremental() %}
        and cast( {{ dbt.date_trunc('day', 'valid_starting_at') }} as date) >= (select max(date_day) from {{ this }})
    {% endif %}
)

where I added this bit, patched together from your code in the date spine.

cast( {{ dbt.date_trunc('day', 'valid_starting_at') }} as date) >=
            {{ dbt.dateadd('year', - var('ticket_field_history_timeframe_years', 50), dbt.current_timestamp_backcompat() ) }}

this limits the amount of data we can analyse, to fix this properly I think the tables should be partitioned by month. (but 10 years is good enough for us...)

Relevant error log or model output

03:30:56  Completed with 1 error and 0 warnings:
03:30:56  
03:30:56  Database Error in model int_zendesk__field_history_pivot (models/ticket_history/int_zendesk__field_history_pivot.sql)
03:30:56    Resources exceeded during query execution: Table datawarehouse-328110:dbt_dwh_zendesk_intermediate.int_zendesk__field_history_pivot will have 4001 partitions when the job finishes, exceeding limit 4000. If partitions were recently expired, it may take some time to be reflected unless explicitly deleted.
03:30:56    compiled Code at target/run/zendesk/models/ticket_history/int_zendesk__field_history_pivot.sql

Expected behavior

I expected the config variable to limit the date range in int_zendesk__field_history_pivot.

dbt Project configurations

at the time of running:

vars:
  zendesk:
    ticket_field_history_timeframe_years: 1

models:
  zendesk:
    +schema: zendesk
    intermediate:
      +schema: zendesk_intermediate
    sla_policy:
      +schema: zendesk_intermediate
    ticket_history:
      +schema: zendesk_intermediate
    utils:
      materialized: table

materializing utils as table was added during my hacking, bigquery fails on too much computing on too little data unless you do this. Same thing happened in the original issue.

Package versions

package: fivetran/zendesk version: [">=0.9.0", "<0.11.0"]

What database are you using dbt with?

bigquery

dbt Version

dbt cloud does no allow running dbt --version, but it's 1.3.

Additional Context

No response

Are you willing to open a PR to help address this issue?

[ ] Yes.
[X] Yes, but I will need assistance and will schedule time during our office hours for guidance
[ ] No.

fivetran-joemarkiewicz commented 1 year ago

Hi @jens-koster thanks for opening this issue!

I agree that I believe all the downstream models from the spine should pull from the calendar spine which would reduce the amount of data, I will have to take a deeper look into the models to see if the solution you provided is the right path forward, or if there is another option we should consider taking to help reduce the amount of data in the downstream models.

That being said, you can adjust the partition of these incremental models without needing an update within the package. If you wanted to change the partition you can actually take an approach similar to the one shown below:

## Your dbt_project.yml

models:
  zendesk:
    zendesk__ticket_field_history:
      +partition_by: {'field': 'date_day', 'data_type': 'date', 'granularity': 'month'}
    ticket_history:
      int_zendesk__field_calendar_spine:
        +partition_by: {'field': 'date_day', 'data_type': 'date', 'granularity': 'month'}
      int_zendesk__field_history_pivot:
        +partition_by: {'field': 'date_day', 'data_type': 'date', 'granularity': 'month'}

This may help with the too many partitions error you are encountering. That being said, I would still like to do a deep dive on the package to understand if there is some place we could optimize the models.

jens-koster commented 1 year ago

Of course! that's exactly what I'll do :-) I think that is the best way to deal with 4000+ days of data. In regards to the fix I did, that was a safe way to get it done without affecting anything else. Now that I know the package a bit better, I'd say joining the calendar spine would be the better solution. Then everything is controlled by whatever is in the spine, rather than reproducing the configuration logic in a second place.

Going to monthly partitioning I am also adding clustering on the date_day column. +cluster_by: 'date_day'

For anyone using this solution: I'd like to mention you might run into the bigquery limitation of using up too much processing for too little data scan. I found it helped to materialize the calendar spine as table. (and for good measure partition it). This problem was also encountered in the original bug report.

    utils:
      int_zendesk__calendar_spine:
        +materialized: table
        +partition_by: {'field': 'date_day', 'data_type': 'date', 'granularity': 'month'}
        +cluster_by: 'date_day'

thank you so much! Jens

fivetran-joemarkiewicz commented 1 year ago

Thanks for sharing that the adjustment in your dbt_project.yml fixed the immediate issue!

I also do agree that this is something we should look to improve within the package to ensure others don't need to make this specific override. I also think you proposed a viable solution with leveraging the already existing calendar spine should help resolve the issue. That being said I will want to do a bit more investigating to ensure this doesn't have any unforeseen consequences.

For the time being I will mark this as a FR for my team to look to improving the model. If anyone else comes across the similar "too many partitions" error you can use the same approach above to resolve the error. I will post back here when we are able to validate the suggestion on joining in the calendar spine to have a more permanent solution.

fivetran-joemarkiewicz commented 2 months ago

Hi All,

I wanted to post back here and share that we have seen this issue appear a few more times and we're going to explore a more permanent solution. I will be adding this to our upcoming sprint, and will share the plan as we develop it in the coming weeks.

fivetran-joemarkiewicz commented 2 months ago

This update has since been rolled out to the package by default within PR #169! As such, I will close out this improvement.

fivetran / dbt_zendesk