bcgov / cthub

Apache License 2.0
2 stars 5 forks source link

CTHUB - ICBC data storage research #394

Open katerinkus opened 2 months ago

katerinkus commented 2 months ago

Problem Description In order to store data efficiently, we need filter out duplicate rows and mark changes. Every month we receive a dataset from ICBC containing records for that month. The assumption is that the VIN is the unique identifier for each record.

Scenario 1: If an individual has a 12 month insurance for the same vehicle, their record will be repeated 12 times for the same VIN. We would only need one record per year.

Scenario 2: If someone sells their car in September, then in the October, the VIN will be the same but the owner details would be different. We would need to keep both of the records because we need it for summary statistics (e.g. # of ZEVs in each regional district).

Scenario 3: Record is removed. No VIN with that record is detected.

We can create a column "Record changes". For each calendar year, it will flag if a record has been created, modified or removed. If a record has been created and have not been modified and removed, that means it exists for all of the months that have no changes detected.

This way we can keep only the information that has changed.

ICBC August Bob - Mazda - Vancouver - VIN 100 Tom - Electric Ford - Kelowna - VIN 909 Jess - Honda - Victoria - VIN 888

ICBC September Bob - Mazda - Vancouver - Vancouver - VIN 100 Mary - Electric Ford - Terrace - VIN 909

Stored in the database August - Bob - Mazda - Vancouver - VIN 100 - Created August - Tom - Electric Ford - Kelowna - VIN 909 - Created August - Jess - Honda - Victoria - VIN 888 - Created September - Mary - Electric Ford - Terrace - VIN 909 - Modified September - Jess - Honda - Victoria - VIN 888 - Removed

Solution Needs

Timebox

Outcome Details describing the outcome of the research