Closed tenajima closed 5 months ago
Hello and thank you for this request. I've changed it to a bug report because this isn't something that should be happening.
Generally, speaking, when modifying the columns of a raw vault structure in production, you shouldn't be modifying the table directly and in-place but creating a new version of the model and transitioning downstream dependent tables to that new version, using dbt model versions or a suitable manual alternative.
With that said, during development, ad-hoc changes are necessary to iterate quickly, and regardless this should not be causing an error.
Please can you provide further details of the error (error messages etc.) to help us find the root cause of the issue?
In regards to the commit you have referenced, this was one commit in a larger satellite re-work to support intra-day loading and introduce more robust satellite logic to handle problem data and incorrect data loading approaches. This was released in 0.10.0, so there's further context in those release notes.
Our development repository is currently private at this time and we cannot accept contributions via the public repository - this is an ongoing issue which we are seeking to resolve and improve for the community.
Hello,
Thank you for your response. Here is the additional information that you requested.
The error message is as follows:
Name {new_column} not found inside current_records at [30:165]
Compiled query is as follows:
I hope this information is helpful for troubleshooting. Looking forward to your feedback and solution. Best Regards
Ok that's clarified things, thank you. The issue is arising because the satellite logic requires a 'look-ahead' to the records currently in the Satellite. As the new column is not in the current version of the satellite (because it didn't exist before) it is giving an error because it cannot find the column.
Generally, in development we would tend to just use the dbt --full-refresh
flag, which would re-create the table with the new column, allowing future loads to reference the new column correctly.
It looks like your table is being deployed to production, so not only would I suggest avoiding doing a --full-refresh
but even if this error was not occurring, I would not advise adding columns in an ad-hoc manner like this, but instead I would suggest implementing appropriate version control as mentioned in my first reply.
Again saying this, we should probably 'plug' this issue - something like dbt's built-in on_schema_change
could be used here to handle newly-discovered columns, perhaps back-filling the values with NULLs.
This wouldn't be a trivial solution to implement however, and we must consider the investment of time taken to develop the solution vs the likelihood of the issue and whether or not it makes sense for us to handle it in the AutomateDV code or whether it should be the user's responsibility.
I would be interested in your thoughts on this!
Hi,
Thanks for clarifying the problem and providing detailed feedback. As you pointed out, our team uses sync_all_columns
with on_schema_change
, anticipating the addition of new columns upon their introduction. An error led us to investigate this issue further.
The addition of a column this time was due to the omission in the satellite definition of a column defined in raw_stg. While the ideal scenario might call for the addition of a new satellite, the frequent creation of small satellites in operation can lead to messy management.
At the same time, it's difficult to maintain strict checks to ensure all columns defined in raw_stg are correctly added to the satellite. In such omission cases, intuitively adding it to the satellite, computing a new hashdiff, and initiating a re-import makes sense to us. We agree that NULL should fill the past values in these cases.
Thank you again for your ideas!
Hi @DVAlexHiggs, we are facing this exact issue. Any chances of incorporating this feature in the future releases?
Fixed and released in v0.11.0, Thank you for your patience on this one!
If the issue persists, please re-open this issue.
Hello,
Is your feature request related to a problem? Please describe. I tried to add a new column to the satellite. I added it, expecting that the existing records would also be re-imported into the satellite by including the new column in the hashdiff. However, an error occurred in the part where automate-dv compiles the code. I am using BigQuery.
Describe the solution you'd like How about reverting the change from
window_cols
tosource_cols
that was made in the latest_records
CTE section ofsat.sql
in this commit?Describe alternatives you've considered When adding a new column, create a new satellite instead of adding it to the existing satellite.
Additional context I wanted to know the background of the above commit, but the link to the development repository in CONTRIBUTING.md was broken and I couldn't view it. Could you please update it? If you accept the proposed solution, I will create a Pull Request.
Thank you for your attention to this matter.
AB#5350