fivetran / dbt_recurly

Fivetran data transformations for Recurly built using dbt.
https://fivetran.github.io/dbt_recurly/
Apache License 2.0
4 stars 2 forks source link

[Bug] Resources exceeded during query execution #8

Closed benigls closed 1 year ago

benigls commented 1 year ago

Is there an existing issue for this?

Describe the issue

After running all the recurly models, I got a memory exceeded error on the last model, recurly__account_daily_overview

Database Error in model recurly__account_daily_overview (models/recurly__account_daily_overview.sql)
07:55:06    Resources exceeded during query execution: The query could not be executed in the allotted memory. Peak usage: 134% of limit.
07:55:06    Top memory consumer(s):
07:55:06      sort operations used for analytic OVER() clauses: 99%
07:55:06      other/unattributed: 1%

Relevant error log or model output

>>> dbt run --select +recurly                       

07:52:41  Running with dbt=1.0.4
07:52:41  Unable to do partial parsing because a project config has changed
07:52:46  Found 203 models, 182 tests, 2 snapshots, 0 analyses, 830 macros, 0 operations, 0 seed files, 185 sources, 0 exposures, 0 metrics
07:52:46  
07:52:48  Concurrency: 2 threads (target='dev')
07:52:48  
07:52:48  1 of 18 START view model dbt_ben.stg_recurly__account_history_tmp............... [RUN]
07:52:48  2 of 18 START view model dbt_ben.stg_recurly__invoice_history_tmp............... [RUN]
07:52:50  2 of 18 OK created view model dbt_ben.stg_recurly__invoice_history_tmp.......... [OK in 1.63s]
07:52:50  3 of 18 START view model dbt_ben.stg_recurly__line_item_history_tmp............. [RUN]
07:52:50  1 of 18 OK created view model dbt_ben.stg_recurly__account_history_tmp.......... [OK in 1.69s]
07:52:50  4 of 18 START view model dbt_ben.stg_recurly__plan_history_tmp.................. [RUN]
07:52:52  4 of 18 OK created view model dbt_ben.stg_recurly__plan_history_tmp............. [OK in 1.80s]
07:52:52  3 of 18 OK created view model dbt_ben.stg_recurly__line_item_history_tmp........ [OK in 1.86s]
07:52:52  5 of 18 START view model dbt_ben.stg_recurly__subscription_history_tmp.......... [RUN]
07:52:52  6 of 18 START view model dbt_ben.stg_recurly__transaction_tmp................... [RUN]
07:52:53  5 of 18 OK created view model dbt_ben.stg_recurly__subscription_history_tmp..... [OK in 1.80s]
07:52:53  7 of 18 START table model dbt_ben.stg_recurly__invoice_history.................. [RUN]
07:52:54  6 of 18 OK created view model dbt_ben.stg_recurly__transaction_tmp.............. [OK in 1.82s]
07:52:54  8 of 18 START table model dbt_ben.stg_recurly__account_history.................. [RUN]
07:52:59  7 of 18 OK created table model dbt_ben.stg_recurly__invoice_history............. [CREATE TABLE (15.0k rows, 2.4 MB processed) in 5.12s]
07:52:59  9 of 18 START table model dbt_ben.stg_recurly__plan_history..................... [RUN]
07:53:01  8 of 18 OK created table model dbt_ben.stg_recurly__account_history............. [CREATE TABLE (123.7k rows, 14.6 MB processed) in 7.00s]
07:53:01  10 of 18 START table model dbt_ben.stg_recurly__line_item_history............... [RUN]
07:53:04  9 of 18 OK created table model dbt_ben.stg_recurly__plan_history................ [CREATE TABLE (210.0 rows, 31.0 KB processed) in 4.91s]
07:53:04  11 of 18 START table model dbt_ben.stg_recurly__subscription_history............ [RUN]
07:53:06  10 of 18 OK created table model dbt_ben.stg_recurly__line_item_history.......... [CREATE TABLE (40.0k rows, 12.1 MB processed) in 5.77s]
07:53:06  12 of 18 START table model dbt_ben.stg_recurly__transaction..................... [RUN]
07:53:08  11 of 18 OK created table model dbt_ben.stg_recurly__subscription_history....... [CREATE TABLE (2.9k rows, 656.4 KB processed) in 4.79s]
07:53:12  12 of 18 OK created table model dbt_ben.stg_recurly__transaction................ [CREATE TABLE (13.4k rows, 6.6 MB processed) in 5.22s]
07:53:12  13 of 18 START table model dbt_ben.recurly__balance_transactions................ [RUN]
07:53:17  13 of 18 OK created table model dbt_ben.recurly__balance_transactions........... [CREATE TABLE (43.2k rows, 11.2 MB processed) in 5.63s]
07:53:17  14 of 18 START table model dbt_ben.recurly__monthly_recurring_revenue........... [RUN]
07:53:17  15 of 18 START table model dbt_ben.recurly__account_overview.................... [RUN]
07:53:23  14 of 18 OK created table model dbt_ben.recurly__monthly_recurring_revenue...... [CREATE TABLE (12.8k rows, 14.1 MB processed) in 5.62s]
07:53:25  15 of 18 OK created table model dbt_ben.recurly__account_overview............... [CREATE TABLE (112.1k rows, 17.2 MB processed) in 7.46s]
07:53:25  16 of 18 START table model dbt_ben.recurly__subscription_overview............... [RUN]
07:53:30  16 of 18 OK created table model dbt_ben.recurly__subscription_overview.......... [CREATE TABLE (1.1k rows, 7.8 MB processed) in 4.80s]
07:53:30  17 of 18 START table model dbt_ben.recurly__churn_analysis...................... [RUN]
07:53:33  17 of 18 OK created table model dbt_ben.recurly__churn_analysis................. [CREATE TABLE (1.1k rows, 166.9 KB processed) in 3.93s]
07:53:34  18 of 18 START table model dbt_ben.recurly__account_daily_overview.............. [RUN]
07:55:06  18 of 18 ERROR creating table model dbt_ben.recurly__account_daily_overview..... [ERROR in 91.67s]
07:55:06  
07:55:06  Finished running 6 view models, 12 table models in 140.07s.
07:55:06  
07:55:06  Completed with 1 error and 0 warnings:
07:55:06  
07:55:06  Database Error in model recurly__account_daily_overview (models/recurly__account_daily_overview.sql)
07:55:06    Resources exceeded during query execution: The query could not be executed in the allotted memory. Peak usage: 134% of limit.
07:55:06    Top memory consumer(s):
07:55:06      sort operations used for analytic OVER() clauses: 99%
07:55:06      other/unattributed: 1%
07:55:06    
07:55:06    compiled SQL at target/run/recurly/models/recurly__account_daily_overview.sql
07:55:06  
07:55:06  Done. PASS=17 WARN=0 ERROR=1 SKIP=0 TOTAL=18


### Expected behavior

All models would run successfully

### dbt Project configurations

- package: fivetran/recurly
  version: 0.1.0

### Package versions

recurly_database: raw-data
recurly_schema: recurly

### What database are you using dbt with?

bigquery

### dbt Version

installed version: 1.0.4
   latest version: 1.3.0

Your version of dbt is out of date! You can find instructions for upgrading here:
https://docs.getdbt.com/docs/installation

Plugins:
  - bigquery: 1.0.0 - Update available!
  Your version of dbt-bigquery is out of date! You can find instructions for upgrading here:
  https://docs.getdbt.com/dbt-cli/install/overview

### Additional Context

_No response_

### Are you willing to open a PR to help address this issue?

- [ ] Yes.
- [ ] Yes, but I will need assistance and will schedule time during our [office hours](https://calendly.com/fivetran-solutions-team/fivetran-solutions-team-office-hours) for guidance
- [ ] No.
dimoschi commented 1 year ago

This is indeed a resource intensive model as it is based on int_recurly__account_rolling_totals which is based on int_recurly__transactions_date_spine, where both tables have around 350 million records. In Snowflake the query took a long time to complete, 2820.44 seconds.

I'm now comparing it a similar model in Stripe stripe__daily_overview to try and find differences/similarities and maybe come up with a more efficient way.

fivetran-joemarkiewicz commented 1 year ago

Thanks for opening this issue @benigls, and thanks for taking a look to dig in @dimoschi!

This is great for us to know and I imagine there should be a way we can optimize this query to be more performant. I will be sure to dig into this as well and explore a possible solution to hopefully be rolled out in our upcoming sprint.

fivetran-avinash commented 1 year ago

Thanks for bringing this issue to our attention @benigls and examining this problem more closely @dimoschi!

It looks like the main culprit might be the multiple OVER functions happening in the daily_overview.

I'm wondering if we take those resource-intensive functions out of the final model, separate them into additional intermediate model(s), then somehow separate them out before we get to the end model and join the two models together), we can reduce the load time and the query runs.

My original thought is we could bring both of the OVER CTEs into one intermediate model, see if it can run on its own. If that still has trouble running, we can split them up into multiple CTEs in two intermediate models.

Then the only operation we have to do is the join to account_overview in the final model and adding the passthrough columns.

Would either of you be able to take a look and try breaking that final model logic up? Otherwise we can attempt to tackle this in an upcoming sprint!

dimoschi commented 1 year ago

I'll do my best to look into it until the end of this week, but cannot promise anything.

fivetran-avinash commented 1 year ago

Hello @benigls and @dimoschi!

I've attempted the above solution to try and resolve the query execution issue. Before I send over the proposed PR, I just want to double-check what your specific dbt environment setup is looking like. Some of our code we're deploying and testing is dependent on your current version, so we might need to make some modifications if you're using a specific version of dbt.

Could you provide the following details:

Thanks for all your help and patience in resolving this issue!

benigls commented 1 year ago

Thank you for taking care of this @fivetran-avinash!

Let me know if you need further info.

fivetran-avinash commented 1 year ago

Thanks @benigls!

I think your versions are up-to-date with what can run on our branch.

Here is the link to the branch with the updated changes: https://github.com/fivetran/dbt_recurly/tree/bug/account-daily-overview-query-resources-exceeded

You can clone the repo, fetch this specific branch above, and explore whether these code changes make a difference in resource execution on your data.

Let me know how it goes and if you have any questions!

benigls commented 1 year ago

Hi @fivetran-avinash. I just tested the branch and all the models are running in my environment and data. 🚀 Thank you!

Side note: I noticed that it took 3.5 minutes to run all the models. I wonder, is there a way to add a condition in the base layer to only process accounts with an existing subscription? The reason for asking is that in our case, we only care about the accounts with a subscription, which is ~1/4 of the entire account population. I think this is a common pattern with companies with a free-tier plan.

fivetran-joemarkiewicz commented 1 year ago

Hi @benigls thanks for testing out the branch and that is great to hear that the models are running with your data now!!

The majority of our team is on PTO until the new year, but we will be sure to pick this up and plan to integrate into the next release once we come back in the new year.

In the meantime, I think your suggestion for filtering out accounts without active subscriptions is completely reasonable and would add value for all of the package users. Realistically, we would probably add a variable to allow customers to configure if they want the filter applied or not. In order for us to ensure we can get this fix out I am going to move the filter request into a new feature request (FR #11) where we will be able to scope it out and integrate into a future sprint. Be sure to follow the linked FR to see when we are able to integrate the filter and test out the branch once it is available.

Thanks again for all your help!

fivetran-avinash commented 1 year ago

Closing this issue since the branch has been merged.