fivetran / dbt_mixpanel

Fivetran's Mixpanel dbt package
https://fivetran.github.io/dbt_mixpanel/
Apache License 2.0
6 stars 10 forks source link

mixpanel__sessions throws LISTAGG limit error #26

Closed marclipoff-curative closed 1 year ago

marclipoff-curative commented 2 years ago

Is there an existing issue for this?

Describe the issue

When running mixpanel__sessions model, redshift throws the following error:

21:45:36  Database Error in model mixpanel__sessions (models/mixpanel__sessions.sql)
21:45:36    Result size exceeds LISTAGG limit
21:45:36    DETAIL:  
21:45:36      -----------------------------------------------
21:45:36      error:  Result size exceeds LISTAGG limit
21:45:36      code:      8001
21:45:36      context:   LISTAGG limit: 65535
21:45:36      query:     933154
21:45:36      location:  string_ops.cpp:116
21:45:36      process:   query3_246_933154 [pid=3755]
21:45:36      -----------------------------------------------

Relevant error log or model output

No response

Expected behavior

The model should run without error

dbt Project configurations

# Name your project! Project names should contain only lowercase characters
# and underscores. A good package name should reflect your organization's
# name or the intended use of these models
name: "dwh"
version: "1.0.0"
config-version: 2

# This setting configures which "profile" dbt uses for this project.
profile: "curative-dwh"

# These configurations specify where dbt should look for different types of files.
# The `model-paths` config, for example, states that models in this project can be
# found in the "models/" directory. You probably won't need to change these!
model-paths: ["models"]
analysis-paths: ["analyses"]
test-paths: ["tests"]
seed-paths: ["seeds"]
macro-paths: ["macros"]
snapshot-paths: ["snapshots"]

log-path:

target-path: "target" # directory which will store compiled SQL files
clean-targets: # directories to be removed by `dbt clean`
  - "target"
  - "dbt_packages"

on-run-start:
  # TODO: create script that runs on staging to make sure that the _0cp and _prod schemas are in sync with any new landing tables

  # when a new schema is created, fix permissions
  # NEED TO FIGURE OUT PERMISSIONS FOR THIS -->  "{{ update_permissions_for_lnd_schemas() }}"

on-run-end:
  # drop any models not in dbt project. only on staging and dev (not prod and ci; too risky)
  - "{{ drop_orphaned_relations(['base', 'curated', 'aggregated'], ['mixpanel', 'fivetran_log'], False if target.name in ['staging'] else True) }}"
  - "{{ re_data.save_test_history(results) if 'prod' == target.name }}"

vars:
  force_row_limit: -1 # if want to enforce a row limit, like we do for CI env

  mixpanel:
    mixpanel_database: dwh
    mixpanel_schema: lnd_mixpanel
    event_table: "{{ source('mixpanel', 'event') }}{{ limit_clause_by_env(10000) }}"
    date_range_start: "2022-03-30"

  greenhouse_database: dwh
  greenhouse_schema: lnd_greenhouse
  greenhouse_using_prospects: true
  greenhouse_using_eeoc: true
  greenhouse_using_app_history: false
  greenhouse_using_job_office: true
  greenhouse_using_job_department: true

  fivetran_log:
    fivetran_log_database: dwh
    fivetran_log_schema: lnd_fivetran
    fivetran_log_using_transformations: false # this will disable all transformation + trigger_table logic
    fivetran_log_using_triggers: false # this will disable only trigger_table logic
    log: "{{ source('fivetran_log', 'log') }}{{ limit_clause_by_env() }}"

models:
  +bind: false # Materialize all views as late-binding
  +post-hook:
    - "{{ redshift.compress_table(this.schema, this.table, drop_backup=False) if this.type == 'table' }}"
    - "{{ test_late_binding_view() }}"

  re_data:
    +enabled: "{{ 'prod' == target.name }}"
    +schema: base
    internal:
      +schema: base

  redshift:
    +bind: true
    +schema: base

  greenhouse_source:
    +schema: base

  mixpanel:
    +schema: base
    staging:
      +schema: base
      +docs:
        show: False

  fivetran_log:
    +schema: base
    staging:
      +schema: base
      +docs:
        show: False

  dwh:
    +materialized: view
    +re_data_monitored: true

    base:
      +schema: base

      app_public:
        staging:
          +re_data_monitored: false
          +materialized: ephemeral
          +docs:
            show: False

      revcycle:
        staging:
          +materialized: ephemeral
          +re_data_monitored: false

      ups:
        +enabled: False # busted for now
      safegraph:
        +enabled: False # need to setup
      rd:
        +enabled: False # need to setup

      twilio:
        twilio_call:
          +enabled: False
        twilio_role:
          +enabled: False
        twilio_role_permission:
          +enabled: False
        twilio_service:
          +enabled: False

    curated:
      +schema: curated
      core:
        cur_core_safegraph_visits_by_home_cbg:
          +enabled: False # need to setup

    aggregated:
      +schema: aggregated

sources:
  greenhouse_source:
    greenhouse:
      application_history:
        +enabled: False

  #dwh:
  #  landing:
  #    src_quadax:
  #      +re_data_monitored: true
  #    src_app_public:
  #      +re_data_monitored: false

seeds:
  +schema: base

tests:
  fivetran_log:
    +enabled: False

Package versions

  - package: dbt-labs/redshift
    version: 0.6.0

  - package: dbt-labs/dbt_utils
    version: 0.8.0

  - package: dbt-labs/dbt_external_tables
    version: 0.8.0

  - package: dbt-labs/codegen
    version: 0.5.0

  - package: fivetran/fivetran_utils
    version: 0.3.4

  - package: fivetran/fivetran_log
    version: 0.5.2

What database are you using dbt with?

redshift

dbt Version

1.0

Additional Context

No response

Are you willing to open a PR to help address this issue?

fivetran-joemarkiewicz commented 2 years ago

@marclipoff-curative thanks so much for opening this issue. I am just starting to look into the error and noticed the LISTAGG issue is occurring within our mixpanel__sessions model at the following line.

https://github.com/fivetran/dbt_mixpanel/blob/7b05388f184b4b124d757d8c68ec5507db5d5e6e/models/mixpanel__sessions.sql#L115

The output of this event frequencies field will create a JSON object of the frequency of each event_type during this user session. As Redshift has a limit of 65535, it seems your sessions are having more frequency than the allotted amount. Before digging into the solution I wanted to ask a few quick questions:

For arguments sake, I actually ended up quickly removing this field if you wanted to see the package models succeed and you can evaluate the outcome and consider the importance of this field. You can leverage this version of the package by adding the below package configuration to your packages.yml

packages:
  - git: https://github.com/fivetran/dbt_mixpanel.git
    revision: hotfix/frequency-variable
    warn-unpinned: false

Let me know your thoughts!

marclipoff-curative commented 2 years ago

I dont need it at the moment. So I'd be fine excluding it. When running that hotfix, I get:


16:09:04  Running with dbt=1.0.1
16:09:04  [WARNING]: Deprecated functionality
The `source-paths` config has been renamed to `model-paths`. Please update your
`dbt_project.yml` configuration to reflect this change.
16:09:04  [WARNING]: Deprecated functionality
The `data-paths` config has been renamed to `seed-paths`. Please update your
`dbt_project.yml` configuration to reflect this change.
16:09:05  [WARNING]: Configuration paths exist in your dbt_project.yml file which do not apply to any resources.
There are 1 unused configuration paths:
- sources.dwh.landing.src_helpscount

16:09:05  Found 443 models, 505 tests, 0 snapshots, 4 analyses, 832 macros, 2 operations, 4 seed files, 633 sources, 0 exposures, 0 metrics
16:09:05
16:09:09  Concurrency: 10 threads (target='prod')
16:09:09
16:09:09  1 of 3 START view model base.stg_mixpanel__event_tmp............................ [RUN]
16:09:12  1 of 3 OK created view model base.stg_mixpanel__event_tmp....................... [CREATE VIEW in 3.27s]
16:09:13  2 of 3 START incremental model base.mixpanel__event............................. [RUN]
16:12:06  2 of 3 OK created incremental model base.mixpanel__event........................ [INSERT 0 2800820 in 172.16s]
16:12:06  3 of 3 START incremental model base.mixpanel__sessions.......................... [RUN]
16:12:06  3 of 3 ERROR creating incremental model base.mixpanel__sessions................. [ERROR in 0.81s]
16:12:07
16:12:07  Running 2 on-run-end hooks
16:12:07  Beginning drop of orphaned relations. dry_run=True
16:12:07  This command would be run: drop view if exists base.redshift_admin_running_queries cascade;
16:12:07  1 of 2 START hook: dwh.on-run-end.0............................................. [RUN]
16:12:07  1 of 2 OK hook: dwh.on-run-end.0................................................ [OK in 0.00s]
16:12:07  2 of 2 START hook: dwh.on-run-end.1............................................. [RUN]
16:12:07  2 of 2 OK hook: dwh.on-run-end.1................................................ [OK in 0.00s]
16:12:07
16:12:07
16:12:07  Finished running 1 view model, 2 incremental models, 2 hooks in 182.13s.
16:12:08
16:12:08  Completed with 1 error and 0 warnings:
16:12:08
16:12:08  Database Error in model mixpanel__sessions (models/mixpanel__sessions.sql)
16:12:08    syntax error at or near "("
16:12:08    LINE 128:     listagg((event_type || ': ' || number_of_events), ', ')
16:12:08                         ^
16:12:08
16:12:08  Done. PASS=2 WARN=0 ERROR=1 SKIP=0 TOTAL=3```
fivetran-joemarkiewicz commented 2 years ago

Sounds good, thanks @marclipoff-curative!

Were you able to run dbt clean before installing the new package? Also, would you be able to run dbt run --full-refresh to make sure we skip over any holdover incremental logic from past runs?

marclipoff-curative commented 2 years ago

I think the problem is the stray ( . right?

fivetran-joemarkiewicz commented 2 years ago

Where do you see the stray (? The error is a bit strange since I completely removed the listagg function from the branch.

marclipoff-curative commented 2 years ago

It compiles to this:

agg_event_types as (

    select
        session_id
        -- turn into json
        -- '{' ||
    listagg((event_type || ': ' || number_of_events), ', ')

 || '}' as event_frequencies

    from (

        select
            session_id,
            event_type,
            count(unique_event_id) as number_of_events

        from session_ids
        group by session_id, event_type

    ) as sub group by session_id
),

I think the fivetran_utils.string_agg macro creates a new line. So when commenting out using a --, the new line doesn't get commented out.

fivetran-joemarkiewicz commented 2 years ago

Can you attempt and re-run dbt deps as I think that may be a previous commit since I can see it is commented out. Below is the current version of that CTE in the hotfix branch:

https://github.com/fivetran/dbt_mixpanel/blob/b28987116981b364196caa109520281f65b075e5/models/mixpanel__sessions.sql#L112

fivetran-joemarkiewicz commented 2 years ago

Actually, I was just messing around and think I found a way for users to leverage this field even if there are a few fields that exceed the limit. See my code below

https://github.com/fivetran/dbt_mixpanel/blob/3b7fcc10ead74c88cccddc2edb890a38c4aee15c/models/mixpanel__sessions.sql#L115-L122

Would you be able to test the new version of the branch as well and let me know if it succeeds for you? If it doesn't, you can try and adjust the new mixpanel__event_frequency_limit variable to be smaller than the default 50,000 I set (50,000 may be too high of a default haha).

vars:
   mixpanel__event_frequency_limit: 1000
fivetran-joemarkiewicz commented 2 years ago

HI @marclipoff-curative I just wanted to reach back out and see if you were able to give the above branch a test?

We are planning to release this early next week and I wanted to make sure it worked on your end before pushing the changes to our next release.

Let me know if you have any questions!

marclipoff-curative commented 2 years ago

I think it worked. My cluster is having a tough time churning on the huge amount of data we have... so cant be 100% yet

fivetran-joemarkiewicz commented 2 years ago

Thanks! I will also limit the default to maybe 1000 to hopefully help with the string agg explosion as well

fivetran-joemarkiewicz commented 1 year ago

Closing this issue as PR #27 addressed the initial issue identified within this bug report.