Extract, Transform, and Load script for fetching new data from the NYC Open Data Portal's vehicle collision data and loading into the NYC Crash Mapper table on CARTO.
3
stars
0
forks
source link
Experiment with NYC geocoder for missing geoms #36
After the most recent round of fixes for null geometries, we still have about 10% of records with no geometries:
SELECT count(cartodb_id) FROM crashes_all_prod
1,950,062 crashes in the whole universe.
SELECT count(cartodb_id) FROM crashes_all_prod
where
latitude is null
187,456 without geometry. That's 9.6%
Of those, 40,557 have neither on_street nor off_street, so they could never be located. That's 2% of total.
SELECT count(cartodb_id) FROM crashes_all_prod
where
latitude is null and
length(on_street_name)=0 and
length(off_street_name)=0
35,483 have both cross streets and could be candidates for geocoding. That's 1.8% of the total.
SELECT count(cartodb_id) FROM crashes_all_prod
where
latitude is null and
length(on_street_name)>0 and
length(off_street_name)>0
111,416 have either on_street or off_street but not both. That's 5.7% of total.
SELECT count(cartodb_id) FROM crashes_all_prod
where
latitude is null and
(length(on_street_name)=0 OR
length(off_street_name)=0) and not
(length(on_street_name)=0 AND
length(off_street_name)=0)
That's the hardest set. In some cases, one could get to borough and maybe even smaller, but geocoding these would put a point at a location on the map that is almost certainly inaccurate. Most geocoders would pick the geospatial center of the length of a street or avenue. This point would then be included in all area calculations and fall into whatever boundaries happen to overlap with that center. This is a common problem where, for example, the center of the US, somewhere in Kansas, gets assigned all kinds of stuff that's clearly not there.
I recommended to Chrstine we not pursue this, but if she wants the 1.8% increase, next step is a manual try at geocoding those to see how the results look.
After the most recent round of fixes for
null
geometries, we still have about 10% of records with no geometries:1,950,062 crashes in the whole universe.
187,456 without geometry. That's 9.6%
Of those, 40,557 have neither on_street nor off_street, so they could never be located. That's 2% of total.
35,483 have both cross streets and could be candidates for geocoding. That's 1.8% of the total.
111,416 have either
on_street
oroff_street
but not both. That's 5.7% of total.That's the hardest set. In some cases, one could get to borough and maybe even smaller, but geocoding these would put a point at a location on the map that is almost certainly inaccurate. Most geocoders would pick the geospatial center of the length of a street or avenue. This point would then be included in all area calculations and fall into whatever boundaries happen to overlap with that center. This is a common problem where, for example, the center of the US, somewhere in Kansas, gets assigned all kinds of stuff that's clearly not there.
I recommended to Chrstine we not pursue this, but if she wants the 1.8% increase, next step is a manual try at geocoding those to see how the results look.
┆Issue is synchronized with this Asana task