Closed jens-koster closed 2 months ago
Hi @jens-koster thanks for opening this issue!
I agree that I believe all the downstream models from the spine should pull from the calendar spine which would reduce the amount of data, I will have to take a deeper look into the models to see if the solution you provided is the right path forward, or if there is another option we should consider taking to help reduce the amount of data in the downstream models.
That being said, you can adjust the partition of these incremental models without needing an update within the package. If you wanted to change the partition you can actually take an approach similar to the one shown below:
## Your dbt_project.yml
models:
zendesk:
zendesk__ticket_field_history:
+partition_by: {'field': 'date_day', 'data_type': 'date', 'granularity': 'month'}
ticket_history:
int_zendesk__field_calendar_spine:
+partition_by: {'field': 'date_day', 'data_type': 'date', 'granularity': 'month'}
int_zendesk__field_history_pivot:
+partition_by: {'field': 'date_day', 'data_type': 'date', 'granularity': 'month'}
This may help with the too many partitions error you are encountering. That being said, I would still like to do a deep dive on the package to understand if there is some place we could optimize the models.
Of course! that's exactly what I'll do :-) I think that is the best way to deal with 4000+ days of data. In regards to the fix I did, that was a safe way to get it done without affecting anything else. Now that I know the package a bit better, I'd say joining the calendar spine would be the better solution. Then everything is controlled by whatever is in the spine, rather than reproducing the configuration logic in a second place.
Going to monthly partitioning I am also adding clustering on the date_day column.
+cluster_by: 'date_day'
For anyone using this solution: I'd like to mention you might run into the bigquery limitation of using up too much processing for too little data scan. I found it helped to materialize the calendar spine as table. (and for good measure partition it). This problem was also encountered in the original bug report.
utils:
int_zendesk__calendar_spine:
+materialized: table
+partition_by: {'field': 'date_day', 'data_type': 'date', 'granularity': 'month'}
+cluster_by: 'date_day'
thank you so much! Jens
Thanks for sharing that the adjustment in your dbt_project.yml fixed the immediate issue!
I also do agree that this is something we should look to improve within the package to ensure others don't need to make this specific override. I also think you proposed a viable solution with leveraging the already existing calendar spine should help resolve the issue. That being said I will want to do a bit more investigating to ensure this doesn't have any unforeseen consequences.
For the time being I will mark this as a FR for my team to look to improving the model. If anyone else comes across the similar "too many partitions" error you can use the same approach above to resolve the error. I will post back here when we are able to validate the suggestion on joining in the calendar spine to have a more permanent solution.
Hi All,
I wanted to post back here and share that we have seen this issue appear a few more times and we're going to explore a more permanent solution. I will be adding this to our upcoming sprint, and will share the plan as we develop it in the coming weeks.
This update has since been rolled out to the package by default within PR #169! As such, I will close out this improvement.
Is there an existing issue for this?
Describe the issue
I am getting partly the same error as reported in issue https://github.com/fivetran/dbt_zendesk/issues/39, which is closed and merged in https://github.com/fivetran/dbt_zendesk/pull/47
I think the problem was fixed for the int_zendeskfield_calendar_spine model but not for int_zendesk__field_history_pivot. The fix introduced a config variable, ticket_field_history_timeframe_years to limit the number of partitions, but that config variable is not used in int_zendeskfield_history_pivot. I believe all other partitioned models use the date spine and get their date range limited by that.
I hacked the model and added the config variable to the where condition in the first cte, it seems to do the trick. however I am not 100% sure what side effects this might have in the data.
where I added this bit, patched together from your code in the date spine.
this limits the amount of data we can analyse, to fix this properly I think the tables should be partitioned by month. (but 10 years is good enough for us...)
Relevant error log or model output
Expected behavior
I expected the config variable to limit the date range in int_zendesk__field_history_pivot.
dbt Project configurations
at the time of running:
materializing utils as table was added during my hacking, bigquery fails on too much computing on too little data unless you do this. Same thing happened in the original issue.
Package versions
What database are you using dbt with?
bigquery
dbt Version
dbt cloud does no allow running dbt --version, but it's 1.3.
Additional Context
No response
Are you willing to open a PR to help address this issue?