CTHUB - ICBC data storage research

Problem Description In order to store data efficiently, we need filter out duplicate rows and mark changes. Every month we receive a dataset from ICBC containing records for that month. The assumption is that the VIN is the unique identifier for each record.

Scenario 1: If an individual has a 12 month insurance for the same vehicle, their record will be repeated 12 times for the same VIN. We would only need one record per year.

Scenario 2: If someone sells their car in September, then in the October, the VIN will be the same but the owner details would be different. We would need to keep both of the records because we need it for summary statistics (e.g. # of ZEVs in each regional district).

Scenario 3: Record is removed. No VIN with that record is detected.

We can create a column "Record changes". For each calendar year, it will flag if a record has been created, modified or removed. If a record has been created and have not been modified and removed, that means it exists for all of the months that have no changes detected.

This way we can keep only the information that has changed.

ICBC August Bob - Mazda - Vancouver - VIN 100 Tom - Electric Ford - Kelowna - VIN 909 Jess - Honda - Victoria - VIN 888

ICBC September Bob - Mazda - Vancouver - Vancouver - VIN 100 Mary - Electric Ford - Terrace - VIN 909

Stored in the database August - Bob - Mazda - Vancouver - VIN 100 - Created August - Tom - Electric Ford - Kelowna - VIN 909 - Created August - Jess - Honda - Victoria - VIN 888 - Created September - Mary - Electric Ford - Terrace - VIN 909 - Modified September - Jess - Honda - Victoria - VIN 888 - Removed

Solution Needs

Evaluate if this approach is feasible. How can it work if the dataset columns vary year to year? Are VINs the best unique identifiers or is a different approach needed? Do you think queries would be more difficult? Is it better to store everything as is (i.e. duplicate records every month)?

Timebox

1 day

Outcome Details describing the outcome of the research

Was it successful? What direction should the work go?
Was it unsuccessful? Discuss next steps with team

bcgov / cthub

CTHUB - ICBC data storage research #394