VMois / miwaitway

Calculate average wait time on bus stops using GTFS real-time vehicle location and display it on a map
https://vmois.dev/miwaitway-average-wait-time-on-stop/
3 stars 0 forks source link

Resolving duplication issues in raw data #18

Closed VMois closed 2 months ago

VMois commented 2 months ago

Even after solving de-duplication issues in the vehicle position ingestor, problems still caused the MERGE request to fail when running a DAG. After some investigation, I found that two CSV files containing multiple chunks of vehicle positions still had an overlap by vehicle_id and timestamp. Because of that, the MERGE step couldn't determine which row to use to insert into the final raw table.

The solution is to load each file one by one. Load a single file to a stage, merge the stage into raw, clean the stage, and repeat. In this way, duplicates do not end up in the same table but will be loaded batch by batch, and with MERGE, they will match and simply update the values.

One minor issue still persists is that, for now, the order of the CSV files in GCS is random. Ideally, duplicated rows (by vehicle_id and timestamp) can have a slight difference where newer rows have, for example, occupancy specified. With current loading, we might lose this data because of random order. However, it is not critical for our desired feature and can be addressed later.