Datavault-UK / automate-dv

A free to use dbt package for creating and loading Data Vault 2.0 compliant Data Warehouses (powered by dbt, an open source data engineering tool, registered trademark of dbt Labs)
https://www.automate-dv.com
Apache License 2.0
478 stars 114 forks source link

[BUG] Using 'vault_insert_by_period' materialisation with period='MILLISECOND' results in Unhandled error: Range too big. #175

Closed anouar-zh closed 1 year ago

anouar-zh commented 1 year ago

Describe the bug I want to load data which I receive in millisecond (example load_dts: 2022-11-09 05:01:23.351). When I use the 'vault_insert_by_period' materialisation with period='MILLISECOND' it results in Unhandled error: Range too big.

14:01:48 Unhandled error while executing model.datavault.hs_store_exploitation Range too big. The sandbox blocks ranges larger than MAX_RANGE (100000).

Environment

dbt version: dbt=1.3.1 dbtvault version: 0.9.1 Database/Platform: databricks

To Reproduce Steps to reproduce the behavior:

  1. Use a dataset with data received in milliseconds
  2. Use the vault_insert_by_period materialisation with period='MILLISECOND'
  3. Execute dbt run
  4. See error

Expected behavior I expect data to be loaded just like it would have been loaded when period day was used

Screenshots

Screenshot 2022-12-23 at 09 06 35

Log files 08:04:53.174466 [error] [MainThread]: Range too big. The sandbox blocks ranges larger than MAX_RANGE (100000). 08:04:52.211738 [debug] [Thread-4 ]: Databricks adapter: NotImplemented: rollback

Additional context The databricks datediff(unit, start, end) receives the unit of measure period as first parameter as opposed to the third parameter in https://github.com/Datavault-UK/dbtvault/blob/4dc4cab0daa9ca17d703a42d50ad651fd2ee707e/macros/materialisations/period_mat_helpers/get_period_boundaries.sql

Screenshot 2022-12-23 at 09 14 00
DVAlexHiggs commented 1 year ago

Hi, thanks for your report!

It's odd that datediff has different parameter order in Databricks, that's perhaps something that needs to be reported to them. We'll look into that (we can fix it our end by using name parameters though)

Generally, millisecond is ill-advised as an iteration period for most use cases, as there will potentially be thousands and thousands of iterations for dbt to perform.

This will be badly performing. How many iterations are there for dbt to do if this were to work?

Either way, you're right that this shouldn't produce an error like this of course, so we'd like to have a solution for this.

anouar-zh commented 1 year ago

In terms of iterations using milliseconds I think a maximum of 20-50 iterations. Is this issue a candidate for a minor fix/release?

DVAlexHiggs commented 1 year ago

In terms of iterations using milliseconds I think a maximum of 20-50 iterations. Is this issue a candidate for a minor fix/release?

Yes of course. Will be worked on after the holiday. Thank you for your report

DVAlexHiggs commented 1 year ago

Hi. We've done some investigation on this and essentially it's due to the range() jinja function having a maximum of 100,000 iterations. Though this is a technical limit, it does highlight the issue I previously described around millisecond periods being ill-advised.

Due to this, this is something we won't be fixing, but will instead provide some error handling and a friendly message, such as:

"Max iterations is 100,000. Consider using a different time period (e.g. day). vault_insert_by materialisations are not intended for this purpose, please see -link to docs-"

We do need some more guidance on this in the docs, but essentially the vault_insert_by materialisations are not for loading large amounts of data such as what you would have in a first time load (base load) and should not be used for highly granular millisecond-difference data. There are other approaches for loading, including those in the DV 2.0 standards.

Our aim for a while now has been to implement waterlevel macros for loading correctly, and release more guidance and documentation on the matter, as we understand how to load is a pain-point for many dbtvault users.

DVAlexHiggs commented 1 year ago

Hi! We've added a warning and some better handling around this issue in 0.9.5.

In a future update, (likely 0.9.6) we will be raising an exception when an attempt is made to use milliseconds.

This report prompted us re-visit our materialisation logic and we've got some refinements and big fixes coming up. Thank you!