Open rwang-lyra opened 2 months ago
Hi @rwang-lyra thank you for bringing this to our attention! I reviewed it, and I am wondering if it would be better to use a different column for clustering, such as the date column _fivetran_synced_date
used in the incremental logic. Would you be able to try out my test branch below if that helps your runtimes? You can install it in place of your normal netsuite package in your packages.yml. You'll also probably need to run a full refresh first before an incremental run since I've changed the cluster by column.
- git: https://github.com/fivetran/dbt_netsuite.git
revision: bug/cluster-by-performance
warn-unpinned: false
The conclusion could very well be that we should remove clustering, but I'd be curious to try this out first.
Thanks @fivetran-catfritz . It doesn't seems these models fall under the recommendation of clustering by snowflake :(https://docs.snowflake.com/en/user-guide/tables-clustering-keys#considerations-for-choosing-clustering-for-a-table), and also high cardinality of the column could be expensive.
In general, if a column (or expression) has higher cardinality, then maintaining clustering on that column is more expensive.
The cost of clustering on a unique key might be more than the benefit of clustering on that key, especially if point lookups are not the primary use case for that table.
If you want to use a column with very high cardinality as a clustering key, Snowflake recommends defining the key as an expression on the column, rather than on the column directly, to reduce the number of distinct values. The expression should preserve the original ordering of the column so that the minimum and maximum values in each partition still enable pruning.
For example, if a fact table has a TIMESTAMP column c_timestamp containing many discrete values (many more than the number of micro-partitions in the table), then a clustering key could be defined on the column by casting the values to dates instead of timestamps (e.g. to_date(c_timestamp)). This would reduce the cardinality to the total number of days, which typically produces much better pruning results.
for learnings and validating, I've just tested with the branch above, the runtime is even much longer (40min) with clustering on timestamp field.
I wonder if any reasoning behind clustering - perhaps it is not needed for snowflake?
@rwang-lyra Thanks for the additional information! I was reading through the doc you linked, and it suggested that clustering on a date column might be a good strategy.
For example, if a fact table has a TIMESTAMP column c_timestamp containing many discrete values (many more than the number of micro-partitions in the table), then a clustering key could be defined on the column by casting the values to dates instead of timestamps (e.g. to_date(c_timestamp)). This would reduce the cardinality to the total number of days, which typically produces much better pruning results.
In my test branch, fivetran_synced_date
is a date and not a timestamp (here is the line where that is set up), so I'm wondering why it is causing worse performance than the transaction_id
, which would have higher cardinality. Some questions I'm thinking of:
dbt runs
, are you using a --full-refresh
each time or setting the materialization as a table, or do you also allow the incremental strategy we include? Let me know what your thoughts are, but it could be beneficial for us to go over this on a live call. If you'd like to have a meeting, you can schedule some time with us via this link! Thanks again for looking into this issue with us.
Hi @rwang-lyra Just a friendly bump on this—whenever you have a chance to take a look, I’d love to hear your thoughts!
Hi @rwang-lyra I hope you’re doing well. I’m going to mark this ticket as stale for now since it's been a while since your last response. If you are able to revisit this issue, feel free to reply here to re-open the conversation, and we’ll be happy to help!
Is there an existing issue for this?
Describe the issue
we have observed much longer runtime with "cluster_by = ['transaction_id']," introduced in v0.13.0 release; the code refence is here: https://github.com/fivetran/dbt_netsuite/pull/117/files#diff-144d46b313d4b1851f6b2a20d16c25a6f41758af76cb008d874c1f61530383f3R7 and is with 3 major models.
seems the sorting takes about 40% of runtime and snowflake warns about the high cardinatlity of transcation_id is causing long runtime.
we have tried locally to compare; on balance_sheet model - 25 minutes with current code, clustering enables vs 14 minutes with clustering removed.
Relevant error log or model output
No response
Expected behavior
shorter execution time
dbt Project configurations
dbt-core==1.7.14 dbt-snowflake==1.7.5
Package versions
What database are you using dbt with?
snowflake
dbt Version
dbt-core==1.7.14 dbt-snowflake==1.7.5
Additional Context
No response
Are you willing to open a PR to help address this issue?