CityofToronto / bdit_data-sources

Data sources used by the Big Data Innovation Team
https://github.com/orgs/CityofToronto/teams/bigdatainnovationteam
GNU General Public License v3.0
38 stars 8 forks source link

`collision_factors` tables don't join neatly to the favoured `collisions` tables #803

Closed Nate-Wessel closed 3 months ago

Nate-Wessel commented 10 months ago

The collisions tables currently in vogue, collisions.events and collisions.involved, have many smallint type columns for the various coded fields. However the corresponding tables in collision_factors are often(always?) text fields. Sometimes a join will work, or partially work, in a way that can easily fool you into thinking it worked as you expected.

SELECT *
FROM collisions.involved
JOIN collision_factors.invtype ON involved.invtype::text = invtype.invtype
LIMIT 100

This query however doesn't return any of the most important categories which are coded with leading zeros, as shown here in collision_factors.invtype:

Capture

Casting the type in the other direction (from text to int) breaks when it encounters the 1A and 1B values.

How should these tables be joined? I think that coming up with my own logic for this on an ad hoc basis (and for every field!) will be very error prone.

mkewins commented 10 months ago

@Nate-Wessel can you confirm that the non-integer codes (i.e. 1A and 1B) only exist for invtype?

In the MOVE collision ETL pipeline, we remap those to integer values: https://github.com/CityofToronto/bdit_move_etl/blob/a4a673d7567998a5b1b2e3e6d9b952ddb3619870/scripts/airflow/tasks/crash_geocoding/A3_involved_fields_norm.sql#L16

1A => 97 1B => 98

We also replaced these values in the MOVE copy of the collision factors tables.

Annotation 2023-12-20 163334

As such, you won't see any collisions with invtype values 1A and 1B if using the new collisions schema.

Here's my suggested approach:

What do you think?

radumas commented 10 months ago

I'm ok with updating collision_factors.invtype to replace

1A => 97
1B => 98

and then these lookup tables should get updated from the MOVE side at some point. @tahaislam can take care of this.

Nate-Wessel commented 10 months ago

can you confirm that the non-integer codes (i.e. 1A and 1B) only exist for invtype?

Some notes on looking through all the tables in this schema:

radumas commented 10 months ago

Permission granted to imploc. It contains zero-padded numbers and a literal 'NULL'

image

gabrielwol commented 3 months ago

I suggest UPDATE collision_factors.imploc SET imploc = NULL WHERE imploc = 'NULL' and then we can close this issue? FYI imploc is not being replicated from Move.

Nate-Wessel commented 3 months ago

I suggest UPDATE collision_factors.imploc SET imploc = NULL WHERE imploc = 'NULL' and then we can close this issue? FYI imploc is not being replicated from Move.

Has anything else been done to address this issue? There were some suggestions above. Have any of them been implemented?

I see that the numeric fields in these tables are still text, often zero-padded. I still think that joining smallints to a text field like that is asking for trouble.

gabrielwol commented 3 months ago

I think the suggestions made by @mkewins could be transfered to MOVE where these tables are being replicated from?

In the meantime,

FROM collisions.involved
LEFT JOIN collision_factors.invtype ON involved.invtype = invtype.invtype::int

(and similar for other collision_factors tables) works now that 1A, 1B are replaced.

mkewins commented 3 months ago

I suggest UPDATE collision_factors.imploc SET imploc = NULL WHERE imploc = 'NULL' and then we can close this issue? FYI imploc is not being replicated from Move.

We have a backlog item for this!

gabrielwol commented 3 months ago

Closing as in MOVE's backlog.