GreenInfo-Network / nyc-crash-mapper-etl-script

Extract, Transform, and Load script for fetching new data from the NYC Open Data Portal's vehicle collision data and loading into the NYC Crash Mapper table on CARTO.
3 stars 0 forks source link

Scope out import of 1 year of 311 data to WalkMapper #32

Closed danrademacher closed 3 years ago

danrademacher commented 3 years ago

Client question:

What would it take to load the past year of 311 data ?

My previous findings for ALL 311 data from scoping period:

The 311 data from Socrata would be a whole parallel project with its own variant of the current ETL script that would be more similar to the current crash script, but with new effort to figure out all the ways that data can be broken, like we have seen with the crash data.

I see that 311 is an order of magnitude more data — 25 million rows! I did an initial filter in Socrata down to what seemed like the most relevant complaint types, and we get down to 5.8 million with those 12 types. I saved that here so you can take a look: https://data.cityofnewyork.us/Social-Services/Curb-and-Street-311-Calls/inpy-irdd

That is still 2.5X the volume of data that we have in Crashmapper. On the one hand, that shouldn't matter since we have figured out all of the chunk-processing and incremental loading, etc, on the crash data. On the other hand, I would expect lots of new data problems given the amount of data that needs to be pulled in.

At the least, we would need to replicate the "Type Assignment" system like we did for Vehicle Types, since the Complaint Types data is very messy. Speaking of which, I was just looking at the Google Sheet we made for you to maintain the Vehicle Type data and it looks like it's been a while since those were updated. Is that system still useful for you?

Filtering the previous all time view to just last 365 days, we have about 533,000 rows: image

So we could import all those, but we need to crosswalk first with our existing schema for everything but Type and then wire in the cross walk for Type

danrademacher commented 3 years ago

Looks like in Crash Mapper we store the original type as "nyc_type" so we could do the same here, import everything into CARTO, and then do the crosswalk as a separate step

danrademacher commented 3 years ago

Also made a preview of the long tail of all complaint types: https://data.cityofnewyork.us/Social-Services/All-Complaint-Types/itrb-yy23

If we really can just use the core 12, we could probably hard code the crosswalk. Otherwise we'd need to import 2.7 million rows to get all types and then select out of those...

danrademacher commented 3 years ago

though we could reduce the import just by excluding NOISE - RESIDENTIAL: image

danrademacher commented 3 years ago

Discussed with Christine today and determined we're not going to move forward with this -- 311 data is just too coarse and off kilter, which is largely why she's making her own app and building partnerships with folks, including ADA groups and others, who want to use it. Swamping that data with 311 would muddy the message.