CityofToronto / bdit_data-sources

Data sources used by the Big Data Innovation Team
https://github.com/orgs/CityofToronto/teams/bigdatainnovationteam
GNU General Public License v3.0
39 stars 8 forks source link

Miovision: Should we filter unacceptable_gaps from volumes_15min_mvt via view instead of during insert? #879

Open gabrielwol opened 9 months ago

gabrielwol commented 9 months ago

Pros:

Cons:

Example implementation and testing:

CREATE TABLE gwolofs.unacceptable_gaps_indexed AS (
    SELECT * FROM miovision_api.unacceptable_gaps 
)

CREATE INDEX unacceptable_gaps_intersection_datetime_bin_idx
ON gwolofs.unacceptable_gaps_indexed
USING btree (intersection_uid, datetime_bin);

--DROP VIEW gwolofs.volumes_15min_mvt_filtered;
CREATE VIEW gwolofs.volumes_15min_mvt_filtered AS (
    SELECT
        v1m.volume_15min_mvt_uid,
        v1m.intersection_uid,
        v1m.datetime_bin,
        v1m.classification_uid,
        v1m.leg,
        v1m.movement_uid,
        CASE
            WHEN un.datetime_bin IS NULL THEN v1m.volume
            ELSE NULL
        END AS volume
    FROM miovision_api.volumes_15min_mvt AS v1m --this would change to _unfiltered and have zeros instead of nulls
    LEFT JOIN gwolofs.unacceptable_gaps_indexed AS un USING (intersection_uid, datetime_bin)
)

--test view performance:
--15.345 + 15.324s + 15.317s : 15.32 avg
SELECT intersection_uid, classification_uid, AVG(volume)
FROM gwolofs.volumes_15min_mvt_filtered
WHERE datetime_bin >= '2023-01-01' AND datetime_bin < '2024-01-01'
--new index useful on smaller slices
--WHERE datetime_bin >= '2023-01-01' AND datetime_bin < '2024-01-01' AND intersection_uid = 65
GROUP BY intersection_uid, classification_uid

--baseline:
--12.2s / 11.724s / 12.119s: 12.01s avg
SELECT intersection_uid, classification_uid, AVG(volume)
FROM miovision_api.volumes_15min_mvt
WHERE datetime_bin >= '2023-01-01' AND datetime_bin < '2024-01-01'
GROUP BY intersection_uid, classification_uid
chmnata commented 9 months ago

Is volumes_15min_mvt the only table affected by this issue? Would we have to do this on volumes_15 table as well ?

gabrielwol commented 9 months ago

You're right, we'd need to add the same view/table treatment to volumes_15min which increases the complexity a bit. Also we'd need to edit many views which refer to these:

volumes_15min:

_RETURN ON aduyves.aadt_miovision_avg_daily _RETURN ON data_requests.i0533_intersec_uoft_atr _RETURN ON covid.miovision_hourly _RETURN ON covid.miovision_hourly_new _RETURN ON covid.miovision_hourly_temp _RETURN ON rapidto.miovision_segments_comparison_daily _RETURN ON rapidto.miovision_segments_comparison_hourly _RETURN ON activeto.miovision_volumes_15min_adj

volumes_15min_mvt:

_RETURN ON data_requests.i0533_intersec_uoft_tmc _RETURN ON gwolofs.open_leg_issues _RETURN ON miovision_api.volumes_15min_tmc

plus all the insert and clear functions

gabrielwol commented 9 months ago

@chmnata one more option I thought of is to change the aggregation order. Instead of find_gaps -> volumes_15min* we could do volumes_15min* ->find_gaps and add an update volumes_15min* clause to find_gaps using the new inserts. This would require much fewer changes to database than above.