Problem DescriptionIn order to store data efficiently, we need filter out duplicate rows and mark changes. Every month we receive a dataset from ICBC containing records for that month. The assumption is that the VIN is the unique identifier for each record.
Scenario 1: If an individual has a 12 month insurance for the same vehicle, their record will be repeated 12 times for the same VIN. We would only need one record per year.
Scenario 2: If someone sells their car in September, then in the October, the VIN will be the same but the owner details would be different. We would need to keep both of the records because we need it for summary statistics (e.g. # of ZEVs in each regional district).
Scenario 3: Record is removed. No VIN with that record is detected.
We can create a column "Record changes". For each calendar year, it will flag if a record has been created, modified or removed. If a record has been created and have not been modified and removed, that means it exists for all of the months that have no changes detected.
This way we can keep only the information that has changed.
ICBC August
Bob - Mazda - Vancouver - VIN 100
Tom - Electric Ford - Kelowna - VIN 909
Jess - Honda - Victoria - VIN 888
ICBC September
Bob - Mazda - Vancouver - Vancouver - VIN 100
Mary - Electric Ford - Terrace - VIN 909
Stored in the database
August - Bob - Mazda - Vancouver - VIN 100 - Created
August - Tom - Electric Ford - Kelowna - VIN 909 - Created
August - Jess - Honda - Victoria - VIN 888 - Created
September - Mary - Electric Ford - Terrace - VIN 909 - Modified
September - Jess - Honda - Victoria - VIN 888 - Removed
Solution Needs
Evaluate if this approach is feasible. How can it work if the dataset columns vary year to year? Are VINs the best unique identifiers or is a different approach needed? Do you think queries would be more difficult? Is it better to store everything as is (i.e. duplicate records every month)?
Timebox
1 day
Outcome
Details describing the outcome of the research
Was it successful? What direction should the work go?
Problem Description In order to store data efficiently, we need filter out duplicate rows and mark changes. Every month we receive a dataset from ICBC containing records for that month. The assumption is that the VIN is the unique identifier for each record.
Scenario 1: If an individual has a 12 month insurance for the same vehicle, their record will be repeated 12 times for the same VIN. We would only need one record per year.
Scenario 2: If someone sells their car in September, then in the October, the VIN will be the same but the owner details would be different. We would need to keep both of the records because we need it for summary statistics (e.g. # of ZEVs in each regional district).
Scenario 3: Record is removed. No VIN with that record is detected.
We can create a column "Record changes". For each calendar year, it will flag if a record has been created, modified or removed. If a record has been created and have not been modified and removed, that means it exists for all of the months that have no changes detected.
This way we can keep only the information that has changed.
ICBC August Bob - Mazda - Vancouver - VIN 100 Tom - Electric Ford - Kelowna - VIN 909 Jess - Honda - Victoria - VIN 888
ICBC September Bob - Mazda - Vancouver - Vancouver - VIN 100 Mary - Electric Ford - Terrace - VIN 909
Stored in the database August - Bob - Mazda - Vancouver - VIN 100 - Created August - Tom - Electric Ford - Kelowna - VIN 909 - Created August - Jess - Honda - Victoria - VIN 888 - Created September - Mary - Electric Ford - Terrace - VIN 909 - Modified September - Jess - Honda - Victoria - VIN 888 - Removed
Solution Needs
Timebox
Outcome Details describing the outcome of the research