GreenInfo-Network / nyc-crash-mapper-etl-script

Extract, Transform, and Load script for fetching new data from the NYC Open Data Portal's vehicle collision data and loading into the NYC Crash Mapper table on CARTO.
3 stars 0 forks source link

Add Business Improvement Districts to ETL process. #30

Closed danrademacher closed 3 years ago

danrademacher commented 3 years ago

As a first step on this new Walk Mapper project, we need to add Business Improvement Districts to CARTO and then assign Business Improvement Districts to our data, similar to https://github.com/GreenInfo-Network/nyc-crash-mapper-etl-script/blob/f9785e323dd595808f3f89a0be217392c5af536f/main.py#L408-L422

The data is here: https://drive.google.com/open?id=1qoUAdSabPUMcoSmkeV9NWIYPORXc3kvA

This will be part of the new Walk Mapper data system, and I think it is prudent to add it to Crashmapper, though I'm not sure yet how or when it will get added to the front end there, so it might be best use of effort to add the new tagging going forward but not yet tackle trying to apply it backwards to ~2 million crashes, which seems like a rabbit hole.

Bills to CHEKPEDS:Walk Mapper Phase 1

gregallensworth commented 3 years ago

We would need to go back and populate businessdistrict in existing entries, since the ETL script does a "where intersects and is null" to do them brute force. So, if we don't do them today in a zillion small batches, it will crash tomorrow when the script tries to run all of them in one go.

gregallensworth commented 3 years ago

Assigning existing crashes

Doing the bulk update of existing records was not difficult, and not overly time-consuming. I ran this 25 times, where X was 0 through 24, thereby doing 1/25th of the crashes at a time and not hitting the time limit. This wasn't even close to the 30-second limit, though, and could probably have been done in % 10 instead.

UPDATE crashes_all_prod
SET businessdistrict = a.bidistrict
FROM nyc_businessdistrict a
WHERE crashes_all_prod.the_geom IS NOT NULL AND ST_Within(crashes_all_prod.the_geom, a.the_geom)
AND crashes_all_prod.businessdistrict IS NULL
AND crashes_all_prod.cartodb_id % 25 = X

Result: 178,665 crashes had a BID assigned which is about 10% of crashes. Not surprising, since most of NYC is not in a B.I.D. This left 1,683,199 crashes not assigned to a BID.

Surprising, though: I just ran a test of all of it in one single query, and the run time was under 3 seconds.

UPDATE crashes_all_prod
SET businessdistrict = a.bidistrict
FROM nyc_businessdistrict a
WHERE crashes_all_prod.the_geom IS NOT NULL AND ST_Within(crashes_all_prod.the_geom, a.the_geom)
AND crashes_all_prod.businessdistrict IS NULL

This is well under the 30-second limit, so should work just fine as a new ETL addition without any special considerations.

gregallensworth commented 3 years ago

The code modification was super simple, and is now in place and deployed to Heroku. it should run tomorrow.

I expect that 90% of them will be null, but still would expect at least a few to have this new field assigned and not to see an error in the heroku console.

SELECT cartodb_id, businessdistrict
FROM crashes_all_prod
WHERE cartodb_id > 3151452
gregallensworth commented 3 years ago

All good here. Last night's run did work, no errors, and with a few of the new crashes being assigned but most not, as expected. 👍