Closed yfzhangleo closed 2 years ago
Thanks for the report. Can you tell me what materialisation you are using please?
Thanks for the report. Can you tell me what materialisation you are using please?
Hi, the materialisation is view
.
Thanks for the report. Can you tell me what materialisation you are using please?
Hi, the materialisation is
view
.
Thanks. Raw vault structures should never be materialised as views. This will break their behaviour because they require data to be stored over time. Please try it using incremental instead and let us know if you see the same behaviour.
Thanks for the report. Can you tell me what materialisation you are using please?
Hi, the materialisation is
view
.Thanks. Raw vault structures should never be materialised as views. This will break their behaviour because they require data to be stored over time. Please try it using incremental instead and let us know if you see the same behaviour.
Sorry, I checked the value by mistake. It seems like we're using incremental
.
Ok that makes sense. I think this is similar to Datavault-UK/dbtvault#50, in that the advised (though not well documented advice) materialisation is vault_insert_by_period
and using incremental can cause some unpredictable behaviour. We need to document this better and also put some checks/safety nets in place to prevent things breaking quite so easily.
Ok that makes sense. I think this is similar to Datavault-UK/dbtvault#50, in that the advised (though not well documented advice) materialisation is
vault_insert_by_period
and using incremental can cause some unpredictable behaviour. We need to document this better and also put some checks/safety nets in place to prevent things breaking quite so easily.
I think they're two different scenarios. In Datavault-UK/dbtvault#50, the two records have different loaddate
while in my case, one same record (every column is the same) will be inserted multiple times when the record is closed (end_date
doesn't equal '9999-12-31') in both the staging model and the eff_sat
model. I don't think using vault_insert_by_period
will work.
Ok that makes sense. I think this is similar to Datavault-UK/dbtvault#50, in that the advised (though not well documented advice) materialisation is
vault_insert_by_period
and using incremental can cause some unpredictable behaviour. We need to document this better and also put some checks/safety nets in place to prevent things breaking quite so easily.
I think the issue here is the closed record has been inserted into eff_sat
twice after dbt run
.
As when we
new_reopened_records AS (
SELECT xxx FROM source_data AS g
INNER JOIN latest_closed AS lc
)
the records from source_data
are not all open, which means there would possibly be some records in new_reopened_records
are actually closed, then these closed records would be inserted into eff_sat
again.
Understood. we'll double check and do some experimentation internally, as I'm sure we have tests for this that are passing.
Just conferred with a colleague. Effectivity Satellites have only ever been tested with vault_insert_by_period
which is the recommended materialisation to use for all raw vault structures which contain a history (everything but Hubs and Links). Can you move to this materialisation and let us know if the problem is fixed? If not then this is definitely a bug.
The reason we advise the use of vault_insert_by_period
instead of incremental is because vault_insert_by_period
includes additional logic to ensure only the most recent (new, previously unloaded) records are being loaded into the raw vault. This is as per Data Vault 2.0 guidance; we don't want to be scanning/performing logic on the entire history every load, but only the difference/delta since the last load. The reason duplicates come in when using incremental
is because we aren't reducing the size of the stage when loading into the raw vault: dbtvault will still attempt to load the full history each time.
Some further internal discussion:
Generally we do want to avoid suggesting vault_insert_by_period
, however we have found many users are trying to load the entire stage history with each load. The easiest workaround is to use vault_insert_by_period
, hence the suggestion.
We have mentioned in many places in the documentation that all loads should only ever be loading new records, not the whole history. In theory if the stage data is only ever new records, the incremental
materialiation should work as expected, but we have a gap in our testing for this because we decided to go down the route of recommending vault_insert_by_period
for tables containing history.
Moving forward we have decided to make incremental
the base case, with the strict caveat that the stage tables need to contain the delta feeds as per Data Vault 2.0 recommendation and standards. This will be quite a big overhaul of some of the macros and structures in dbtvault, so it's not something we can do overnight. For now, please use vault_insert_by_period
.
Okay, we'll try vault_insert_by_period
to see if that works.
Thank you.
We will ensure that the methods for loading the Raw Vault are documented more clearly. This is not a bug as such, but a mismatch between the intended implementation of Data Vault 2.0 and dbtvault.
In short: A single load/batch (i.e. dataset present in the stage layer) should contain only a delta set of data; the changes since the last load. This means each batch should only have unique PKs. We can have 10 different PKs, but we cannot have any of those 10 PKs again in the same load.
Closing in favour of Datavault-UK/dbtvault-docs#44. Thank you.
Describe the bug For records closed (
end_date
does not equal '9999-12-31'), when you run thedbt run -m ${model_name}
in the second time, it will insert the same records into the eff_sat table.Versions
dbt: 0.20.0 dbtvault: 0.7.6.1
To Reproduce Steps to reproduce the behavior:
end_date
is not '9999-12-31'source_model
is the staging table set in step 1dbt run -m ${model_name}
twiceExpected behavior In the second
dbt run
, it shouldn't insert the same closed record againScreenshots
Log files If applicable, provide dbt log files which include the problem.
Additional context I think this should be a bug in dbtvault.eff_sat, https://github.com/Datavault-UK/dbtvault/blob/master/macros/tables/eff_sat.sql#L56
For a record which status is closed (
end_date
doesn't equal '9999-12-31') ineff_sat
, it should appear inlatest_records
. While if the status of the record doesn't change in the staging table, then innew_reopened_records
, it will not be filtered out and will be inserted again.Seems like adding a
where
clause to filter outend_date = TO_DATE('9999-12-31')
innew_reopened_records
can solve the problem.