GreenInfo-Network / nyc-crash-mapper-etl-script

Extract, Transform, and Load script for fetching new data from the NYC Open Data Portal's vehicle collision data and loading into the NYC Crash Mapper table on CARTO.
3 stars 0 forks source link

New ETL script, similar to main.py, but for Walk Mapper #31

Closed danrademacher closed 3 years ago

danrademacher commented 3 years ago

We need to develop a script to consume data from a new database, and then run it through a similar set of intersections as we do crashes in this code, https://github.com/GreenInfo-Network/nyc-crash-mapper-etl-script/blob/f9785e323dd595808f3f89a0be217392c5af536f/main.py#L408-L517, plus a new Business Improvement Districts as described in #30.

The new application is called "WalkMapper" and it is a crowdsourcing tool that will allow New Yorkers to log all kinds of sidewalk problems and obstructions, and the it will auto-file complaints for them. Our remit is to get the data into CARTO for later mapping. In this phase, the only end-user map will be built with CARTO's GUI tools.

There is very little data in this system as of now. Unlike Crashmapper drawing from a massive public dataset, this is a new app that will build up data over time.

Steps as I understand them:

  1. Confirm programmatic access to the database using new ENVvars based on the info in Walk Mapper CHEKPEDS app DB credentials. I have tested access via a desktop GUI called Sequel Pro and was able to inspect the data. My hope is we can use config vars and Python scripts to connect directly to the database. There's really no API though. My assumption is we can connect directly and run SQL queries, but that might take some R+D on how do do that in a Heroku-friendly way.
  2. Access the core tables: obstructionDetails and obstructionImageDetails and categoryMaster
  3. Spec out the fields we'll need in CARTO -- this will likely need to be a separate issue once you get the basic data connection figured out.

For the CARTO GUI map, we'd want to be able to show Obstruction location, date reported, Category/type, geometry intersections (burough, city council, etc), and one or more images. We can ignore any assignments of borough or other boundaries in the application data since we'll get better data through intersection. And then we also need a smart way to pull in the file names of related images. If we can do this in an array, then we can include multiple image names in the same field and not have to recreate the whole relational table structure on our side.

For now, take a look around and make sure you can access the database and see the way forward to consume data and write to CARTO.

This project bills to: CHEKPEDS:Walk Mapper Phase 1

gregallensworth commented 3 years ago

Logging in worked A-OK and I see the 3 basic tables, and can query a list of them like this. So that's our basics.

SELECT
    o.obstructionLat, o.obstructionLong,
    o.createdAt,
    CONCAT(o.buildingNumber, ' ', o.streetName) AS address,
    c2.name AS topcategory, c1.name AS subcategory,
    o.isFirstTime, o.isSecondTime, o.isThirdTime,
    o.isCompleted, o.completedDate
FROM
    obstructionDetails o,
    categoryMaster c1, categoryMaster c2
WHERE
    o.categoryId = c1.id AND c1.parentId = c2.id
AND NOT o.isDelete
;

Photos being relational and multiple, we want to collapse them into a single query with fields like photo1 photo2 photo3 and presume some max number that would really come up. This would be done by a whole second query, though of course slows down the running time. Smarter methods such as CTEs and window functions, are not supported by their MySQL 5.6.36 (2017) and are new to MySQL 8, so the second-query method is how we must do it.

SELECT image FROM obstructionImagesDetails WHERE obstructionId=XXX ORDER BY id;
gregallensworth commented 3 years ago

The basics are now in place:

New CARTO table is named walkmapper_obstructions https://chekpeds.carto.com/tables/walkmapper_obstructions/table

image

image

This does not yet include images:

Question about isdelete field, if this is something we we need to acknowledge. Current behavior: this is filtered out from queries when syncing to CARTO (record will not be created nor updated), but has no effect on records already in CARTO.

And of course scheduling a daily run:

Other potential TBD:

gregallensworth commented 3 years ago

Scheduling worked, after I added mysqlclient to requirements.txt The expected records were updated and inserted, no sweat.

danrademacher commented 3 years ago

I checked with client on this, and it sounds like we need to do one extra thing, as you anticipated:

isDelete does mean that the record is deleted, and we should not import it. But also, we should check to see if any records we imported previously now have isDelete set to true and we should delete those from our database.

For the images, I appended the URL prefix to the test map on CARTO and it loads just fine: https://chekpeds.carto.com/viz/a71e66b8-cb4b-4c0a-aee7-3ef79a3f4b1a/public_map I think we can leave the base url out of the data and apply it in whatever frontend we have a sitewide parameter, at least until and unless she starts pulling images from multiple sources.

gregallensworth commented 3 years ago

isDelete support is now working

danrademacher commented 3 years ago

Calling this done!