Closed anouar-zh closed 1 year ago
Hi, thanks for your report!
It's odd that datediff has different parameter order in Databricks, that's perhaps something that needs to be reported to them. We'll look into that (we can fix it our end by using name parameters though)
Generally, millisecond is ill-advised as an iteration period for most use cases, as there will potentially be thousands and thousands of iterations for dbt to perform.
This will be badly performing. How many iterations are there for dbt to do if this were to work?
Either way, you're right that this shouldn't produce an error like this of course, so we'd like to have a solution for this.
In terms of iterations using milliseconds I think a maximum of 20-50 iterations. Is this issue a candidate for a minor fix/release?
In terms of iterations using milliseconds I think a maximum of 20-50 iterations. Is this issue a candidate for a minor fix/release?
Yes of course. Will be worked on after the holiday. Thank you for your report
Hi. We've done some investigation on this and essentially it's due to the range()
jinja function having a maximum of 100,000 iterations. Though this is a technical limit, it does highlight the issue I previously described around millisecond periods being ill-advised.
Due to this, this is something we won't be fixing, but will instead provide some error handling and a friendly message, such as:
"Max iterations is 100,000. Consider using a different time period (e.g. day). vault_insert_by materialisations are not intended for this purpose, please see -link to docs-"
We do need some more guidance on this in the docs, but essentially the vault_insert_by
materialisations are not for loading large amounts of data such as what you would have in a first time load (base load) and should not be used for highly granular millisecond-difference data. There are other approaches for loading, including those in the DV 2.0 standards.
Our aim for a while now has been to implement waterlevel macros for loading correctly, and release more guidance and documentation on the matter, as we understand how to load is a pain-point for many dbtvault users.
Hi! We've added a warning and some better handling around this issue in 0.9.5.
In a future update, (likely 0.9.6) we will be raising an exception when an attempt is made to use milliseconds.
This report prompted us re-visit our materialisation logic and we've got some refinements and big fixes coming up. Thank you!
Describe the bug I want to load data which I receive in millisecond (example load_dts: 2022-11-09 05:01:23.351). When I use the 'vault_insert_by_period' materialisation with period='MILLISECOND' it results in Unhandled error: Range too big.
14:01:48 Unhandled error while executing model.datavault.hs_store_exploitation Range too big. The sandbox blocks ranges larger than MAX_RANGE (100000).
Environment
dbt version: dbt=1.3.1 dbtvault version: 0.9.1 Database/Platform: databricks
To Reproduce Steps to reproduce the behavior:
period='MILLISECOND'
dbt run
Expected behavior I expect data to be loaded just like it would have been loaded when period day was used
Screenshots
Log files
[0m08:04:53.174466 [error] [MainThread]: [33mRange too big. The sandbox blocks ranges larger than MAX_RANGE (100000).[0m [0m08:04:52.211738 [debug] [Thread-4 ]: Databricks adapter: NotImplemented: rollback
Additional context The databricks datediff(unit, start, end) receives the unit of measure period as first parameter as opposed to the third parameter in https://github.com/Datavault-UK/dbtvault/blob/4dc4cab0daa9ca17d703a42d50ad651fd2ee707e/macros/materialisations/period_mat_helpers/get_period_boundaries.sql