GreenInfo-Network / nyc-crash-mapper-etl-script

Extract, Transform, and Load script for fetching new data from the NYC Open Data Portal's vehicle collision data and loading into the NYC Crash Mapper table on CARTO.
3 stars 0 forks source link

Experiment with NYC geocoder for missing geoms #36

Open danrademacher opened 2 years ago

danrademacher commented 2 years ago

After the most recent round of fixes for null geometries, we still have about 10% of records with no geometries:

SELECT count(cartodb_id) FROM crashes_all_prod 

1,950,062 crashes in the whole universe.

SELECT count(cartodb_id) FROM crashes_all_prod 
where 
latitude is  null

187,456 without geometry. That's 9.6%

Of those, 40,557 have neither on_street nor off_street, so they could never be located. That's 2% of total.

SELECT count(cartodb_id) FROM crashes_all_prod 
where 
latitude is  null and 
length(on_street_name)=0 and 
length(off_street_name)=0

35,483 have both cross streets and could be candidates for geocoding. That's 1.8% of the total.

SELECT count(cartodb_id) FROM crashes_all_prod 
where 
latitude is  null and 
length(on_street_name)>0 and 
length(off_street_name)>0

111,416 have either on_street or off_street but not both. That's 5.7% of total.

SELECT count(cartodb_id) FROM crashes_all_prod 
where 
latitude is  null and 
(length(on_street_name)=0 OR 
length(off_street_name)=0) and not
(length(on_street_name)=0 AND 
length(off_street_name)=0)

That's the hardest set. In some cases, one could get to borough and maybe even smaller, but geocoding these would put a point at a location on the map that is almost certainly inaccurate. Most geocoders would pick the geospatial center of the length of a street or avenue. This point would then be included in all area calculations and fall into whatever boundaries happen to overlap with that center. This is a common problem where, for example, the center of the US, somewhere in Kansas, gets assigned all kinds of stuff that's clearly not there.

I recommended to Chrstine we not pursue this, but if she wants the 1.8% increase, next step is a manual try at geocoding those to see how the results look.

┆Issue is synchronized with this Asana task

danrademacher commented 2 years ago

If we do geocode, take a look at https://labs-geosearch-docs.netlify.app/docs/