[Bug] Performance degradation by clustering/sorting

rwang-lyra commented 2 months ago

Is there an existing issue for this?

[X] I have searched the existing issues

Describe the issue

we have observed much longer runtime with "cluster_by = ['transaction_id']," introduced in v0.13.0 release; the code refence is here: https://github.com/fivetran/dbt_netsuite/pull/117/files#diff-144d46b313d4b1851f6b2a20d16c25a6f41758af76cb008d874c1f61530383f3R7 and is with 3 major models.

seems the sorting takes about 40% of runtime and snowflake warns about the high cardinatlity of transcation_id is causing long runtime.

we have tried locally to compare; on balance_sheet model - 25 minutes with current code, clustering enables vs 14 minutes with clustering removed.

Relevant error log or model output

No response

Expected behavior

shorter execution time

dbt Project configurations

dbt-core==1.7.14 dbt-snowflake==1.7.5

Package versions

package: dbt-labs/dbt_utils version: 1.1.1
- package: brooklyn-data/dbt_artifacts version: 2.3.0
- package: dbt-labs/dbt_project_evaluator version: 0.8.0
- package: dbt-labs/dbt_external_tables version: 0.8.7

What database are you using dbt with?

snowflake

dbt Version

dbt-core==1.7.14 dbt-snowflake==1.7.5

Additional Context

No response

Are you willing to open a PR to help address this issue?

[ ] Yes.
[X] Yes, but I will need assistance and will schedule time during our office hours for guidance
[ ] No.

fivetran-catfritz commented 2 months ago

Hi @rwang-lyra thank you for bringing this to our attention! I reviewed it, and I am wondering if it would be better to use a different column for clustering, such as the date column _fivetran_synced_date used in the incremental logic. Would you be able to try out my test branch below if that helps your runtimes? You can install it in place of your normal netsuite package in your packages.yml. You'll also probably need to run a full refresh first before an incremental run since I've changed the cluster by column.

- git: https://github.com/fivetran/dbt_netsuite.git
  revision: bug/cluster-by-performance
  warn-unpinned: false

The conclusion could very well be that we should remove clustering, but I'd be curious to try this out first.

rwang-lyra commented 2 months ago

Thanks @fivetran-catfritz . It doesn't seems these models fall under the recommendation of clustering by snowflake :(https://docs.snowflake.com/en/user-guide/tables-clustering-keys#considerations-for-choosing-clustering-for-a-table), and also high cardinality of the column could be expensive.

In general, if a column (or expression) has higher cardinality, then maintaining clustering on that column is more expensive.

The cost of clustering on a unique key might be more than the benefit of clustering on that key, especially if point lookups are not the primary use case for that table.

If you want to use a column with very high cardinality as a clustering key, Snowflake recommends defining the key as an expression on the column, rather than on the column directly, to reduce the number of distinct values. The expression should preserve the original ordering of the column so that the minimum and maximum values in each partition still enable pruning.

For example, if a fact table has a TIMESTAMP column c_timestamp containing many discrete values (many more than the number of micro-partitions in the table), then a clustering key could be defined on the column by casting the values to dates instead of timestamps (e.g. to_date(c_timestamp)). This would reduce the cardinality to the total number of days, which typically produces much better pruning results.

for learnings and validating, I've just tested with the branch above, the runtime is even much longer (40min) with clustering on timestamp field.

I wonder if any reasoning behind clustering - perhaps it is not needed for snowflake?

fivetran-catfritz commented 2 months ago

@rwang-lyra Thanks for the additional information! I was reading through the doc you linked, and it suggested that clustering on a date column might be a good strategy.

For example, if a fact table has a TIMESTAMP column c_timestamp containing many discrete values (many more than the number of micro-partitions in the table), then a clustering key could be defined on the column by casting the values to dates instead of timestamps (e.g. to_date(c_timestamp)). This would reduce the cardinality to the total number of days, which typically produces much better pruning results.

In my test branch, fivetran_synced_date is a date and not a timestamp (here is the line where that is set up), so I'm wondering why it is causing worse performance than the transaction_id, which would have higher cardinality. Some questions I'm thinking of:

When you are doing your dbt runs, are you using a --full-refresh each time or setting the materialization as a table, or do you also allow the incremental strategy we include?
Are the performance issues occurring during a full-refresh/table run and/or during an incremental run? While the performance may be slower on the first run, the hope is that the incremental runs are improved since that clustering only has to be established once.
We've had other customers request clustering for Snowflake in other packages, for example in [this issue] (https://github.com/fivetran/dbt_mixpanel/issues/37) for our mixpanel package, which is why we included it in this release. But in the case of mixpanel, which is events rather than transactions, I realize it could be a different volume of data we're dealing with, so curious approximately how large is your transactions table?

Let me know what your thoughts are, but it could be beneficial for us to go over this on a live call. If you'd like to have a meeting, you can schedule some time with us via this link! Thanks again for looking into this issue with us.

fivetran-catfritz commented 1 month ago

Hi @rwang-lyra Just a friendly bump on this—whenever you have a chance to take a look, I’d love to hear your thoughts!

fivetran-catfritz commented 1 month ago

Hi @rwang-lyra I hope you’re doing well. I’m going to mark this ticket as stale for now since it's been a while since your last response. If you are able to revisit this issue, feel free to reply here to re-open the conversation, and we’ll be happy to help!

fivetran / dbt_netsuite