dataintoresults / data-brewery

Data Brewery is an ETL (Extract-Transform-Load) program that connect to many data sources (cloud services, databases, ...) and manage data warehouse workflow.
https://databrewery.co
Apache License 2.0
16 stars 0 forks source link

Question on Archive Module #3

Open viki-crypto opened 3 years ago

viki-crypto commented 3 years ago

I am a little confused on usage of archive module with overwrite strategy. Please let me know if you can explain an approach for my use case.

I have an Orders table, which has over 1 million records. The primary key is ORDER_ID. The same table also keeps track of status of each order, with an order_status column - which can have values within a value range (ordered, accepted, processing, shipped, delivered, returned and a few more). This status is changed over a period of time sometime a month depending on order lifecycle.

How should I go about setting up an archive module for such case - e.g. what business keys should I chose - just the order_id or combination of columns. How does archive module work, I could understand for the new records but how does it gather if any record is updated in source db. thanks in advance.

SebVentures commented 3 years ago

There is no system to keep track of changing rows natively.

If you have a creation_date that is static on the invoice, you can use order_id and order_status as business_key. You can add a column current_date as last_seen_status (current_date being a postgresql function).

The would give you something like

order_id, order_status , last_seen_status, creation_date 1, ordered, 2020-06-16, 2020-06-15 1, accepted, 2020-06-18, 2020-06-15 1, processing, 2020-06-21, 2020-06-15

Using an ordering and the lag() over () function you can compute the first_seen a and last_seen. This would a be a different table (with a rebuild strategy).

order_id, order_status , first_seen, last_seen_status, creation_date 1, ordered, 2020-06-15, 2020-06-16, 2020-06-15 1, accepted, 2020-06-16, 2020-06-18, 2020-06-15 1, processing, 2020-06-18, 2020-06-21, 2020-06-15

viki-crypto commented 3 years ago

This sounds interesting, Let me try this approach. Thanks

Also, is there any way to add indexes on the tables getting created from within the pipeline?

SebVentures commented 3 years ago

row_number() over () should do the trick