Datavault-UK / automate-dv

A free to use dbt package for creating and loading Data Vault 2.0 compliant Data Warehouses (powered by dbt, an open source data engineering tool, registered trademark of dbt Labs)
https://www.automate-dv.com
Apache License 2.0
478 stars 114 forks source link

[BUG] Adding column to a satellite causes error on bigquery #227

Closed tenajima closed 3 weeks ago

tenajima commented 4 months ago

Hello,

Is your feature request related to a problem? Please describe. I tried to add a new column to the satellite. I added it, expecting that the existing records would also be re-imported into the satellite by including the new column in the hashdiff. However, an error occurred in the part where automate-dv compiles the code. I am using BigQuery.

Describe the solution you'd like How about reverting the change from window_cols to source_cols that was made in the latest_records CTE section of sat.sql in this commit?

Describe alternatives you've considered When adding a new column, create a new satellite instead of adding it to the existing satellite.

Additional context I wanted to know the background of the above commit, but the link to the development repository in CONTRIBUTING.md was broken and I couldn't view it. Could you please update it? If you accept the proposed solution, I will create a Pull Request.

Thank you for your attention to this matter.

AB#5350

DVAlexHiggs commented 4 months ago

Hello and thank you for this request. I've changed it to a bug report because this isn't something that should be happening.

Generally, speaking, when modifying the columns of a raw vault structure in production, you shouldn't be modifying the table directly and in-place but creating a new version of the model and transitioning downstream dependent tables to that new version, using dbt model versions or a suitable manual alternative.

With that said, during development, ad-hoc changes are necessary to iterate quickly, and regardless this should not be causing an error.

Please can you provide further details of the error (error messages etc.) to help us find the root cause of the issue?

In regards to the commit you have referenced, this was one commit in a larger satellite re-work to support intra-day loading and introduce more robust satellite logic to handle problem data and incorrect data loading approaches. This was released in 0.10.0, so there's further context in those release notes.

Our development repository is currently private at this time and we cannot accept contributions via the public repository - this is an ongoing issue which we are seeking to resolve and improve for the community.

tenajima commented 4 months ago

Hello, Thank you for your response. Here is the additional information that you requested. The error message is as follows: Name {new_column} not found inside current_records at [30:165]

Compiled query is as follows:

The compiled query (please note that the column names and others have been replaced with dummy names) ```sql /* {"app": "dbt", "dbt_version": "1.7.7", "profile_name": "user", "target_name": "prod", "node_id": "model.project_name.sat_table_name"} */ create or replace table `project_name`.`dbt_raw_vault`.`sat_table_name__dbt_tmp` OPTIONS( description="""""", expiration_timestamp=TIMESTAMP_ADD(CURRENT_TIMESTAMP(), INTERVAL 12 hour) ) as ( -- Generated by AutomateDV (formerly known as dbtvault) WITH source_data AS ( SELECT a.primary_key, a.hashdiff, a.attribute_1, a.attribute_2, ... , a.load_timestamp, a.record_source FROM `project_name`.`dbt_stage`.`stg_table_name` AS a WHERE a.primary_key IS NOT NULL ), latest_records AS ( SELECT current_records.primary_key, current_records.hashdiff, current_records.attribute_1, current_records.attribute_2, ..., current_records.load_timestamp, current_records.record_source, RANK() OVER ( PARTITION BY current_records.primary_key ORDER BY current_records.load_timestamp DESC ) AS rank_num FROM `project_name`.`dbt_raw_vault`.`sat_table_name` AS current_records JOIN ( SELECT DISTINCT source_data.primary_key FROM source_data ) AS source_records ON source_records.primary_key = current_records.primary_key QUALIFY rank_num = 1 ), first_record_in_set AS ( SELECT sd.primary_key, sd.hashdiff, sd.attribute_1, sd.attribute_2, .... , sd.load_timestamp, sd.record_source, RANK() OVER ( PARTITION BY sd.primary_key ORDER BY sd.load_timestamp ASC ) as asc_rank FROM source_data as sd QUALIFY asc_rank = 1 ), unique_source_records AS ( SELECT DISTINCT sd.primary_key, sd.hashdiff, sd.attribute_1, sd.attribute_2, ..., sd.load_timestamp, sd.record_source FROM source_data as sd QUALIFY sd.hashdiff != LAG(sd.hashdiff) OVER ( PARTITION BY sd.primary_key ORDER BY sd.load_timestamp ASC) ), records_to_insert AS ( SELECT frin.primary_key, frin.hashdiff AS hashdiff, frin.attribute_1, frin.attribute_2, ..., frin.load_timestamp, frin.record_source FROM first_record_in_set AS frin LEFT JOIN latest_records lr ON lr.primary_key = frin.primary_key AND lr.hashdiff = frin.hashdiff WHERE lr.hashdiff IS NULL UNION DISTINCT SELECT usr.primary_key, usr.hashdiff, usr.attribute_1, usr.attribute_2, ..., usr.load_timestamp, usr.record_source FROM unique_source_records as usr ) SELECT * FROM records_to_insert ); ```

I hope this information is helpful for troubleshooting. Looking forward to your feedback and solution. Best Regards

DVAlexHiggs commented 4 months ago

Ok that's clarified things, thank you. The issue is arising because the satellite logic requires a 'look-ahead' to the records currently in the Satellite. As the new column is not in the current version of the satellite (because it didn't exist before) it is giving an error because it cannot find the column.

Generally, in development we would tend to just use the dbt --full-refresh flag, which would re-create the table with the new column, allowing future loads to reference the new column correctly.

It looks like your table is being deployed to production, so not only would I suggest avoiding doing a --full-refresh but even if this error was not occurring, I would not advise adding columns in an ad-hoc manner like this, but instead I would suggest implementing appropriate version control as mentioned in my first reply.

Again saying this, we should probably 'plug' this issue - something like dbt's built-in on_schema_change could be used here to handle newly-discovered columns, perhaps back-filling the values with NULLs.

This wouldn't be a trivial solution to implement however, and we must consider the investment of time taken to develop the solution vs the likelihood of the issue and whether or not it makes sense for us to handle it in the AutomateDV code or whether it should be the user's responsibility.

I would be interested in your thoughts on this!

tenajima commented 4 months ago

Hi,

Thanks for clarifying the problem and providing detailed feedback. As you pointed out, our team uses sync_all_columns with on_schema_change, anticipating the addition of new columns upon their introduction. An error led us to investigate this issue further.

The addition of a column this time was due to the omission in the satellite definition of a column defined in raw_stg. While the ideal scenario might call for the addition of a new satellite, the frequent creation of small satellites in operation can lead to messy management.

At the same time, it's difficult to maintain strict checks to ensure all columns defined in raw_stg are correctly added to the satellite. In such omission cases, intuitively adding it to the satellite, computing a new hashdiff, and initiating a re-import makes sense to us. We agree that NULL should fill the past values in these cases.

Thank you again for your ideas!

mrcool4 commented 2 months ago

Hi @DVAlexHiggs, we are facing this exact issue. Any chances of incorporating this feature in the future releases?

DVAlexHiggs commented 3 weeks ago

Fixed and released in v0.11.0, Thank you for your patience on this one!

If the issue persists, please re-open this issue.