NYCPlanning / data-engineering

Primary repository for NYC DCP's Data Engineering team
14 stars 0 forks source link

GFT - Natural Resources #704

Closed fvankrieken closed 2 months ago

fvankrieken commented 2 months ago

closes #672

Will be easiest to go commit by commit

More maintenance-type stuff

  1. unrelated issue, make it easier to use dcpy from cli to import single dataset
  2. simple tweak to clip_to_geom macro
  3. rename elevated railways to "exposed railways" as per the guiding excel sheet
  4. limit usfws_wetlands to just 4 watersheds that touch NYC. Original dataset was gigabites (all of NY state), so this makes it a little easier to use

Meat of new logic

  1. for all existing natural resources (from #685 ), add to gft up to int_buffers__all. Since these datasets are not buffered, I currently have buffer as null in their staging/buffer tables, then coalesce at int_buffers__all. Do we feel like this is unsafe? Or unintuitive. It probably just makes more sense to do this upstream, with the individual tables, I did it this way just to keep those tables more specific to their source data. also, this commit does not add expectations to _sources.yml, that comes in another commit
  2. add three additional datasets, from library to int_buffers__all
  3. add expectations for all _sources, both original ones and those added in commit 6
  4. get data into the wide table. This adds a custom column which aggregates all natural resources
  5. remove constraint that variables in seed should have survey question - I think these are in flux, plus not set for some of them at the moment
  6. HEXAGONS. This got runtime for this query from 36 minutes down to 2ish. This is all mainly around the single monstrous multipolygon of natural_resource_shadows, which combines all natural resources (and then buffers them, as they're not buffered to start). It might have been simpler to just leave them as individual buffered polygons and see if that was somewhat performant, but this was fun. And also, it's nice to treat all variables the same in this case, and not need to aggregate that column later
fvankrieken commented 2 months ago

Hex runtime comparison:

36 min with no "tiling"

10 minutes with tiling

3 minutes with tiling and no distance calculated for this specific variable

Got it down to 2ish minutes locally by fiddling with hexagon size

fvankrieken commented 2 months ago

Getting pilot projects expecations sorted out, but code is ready for review.

Not quite sure if it makes senes but leaving out export for now just because I've been tinkering up till this point for a while, and would just like to get merged. Export should be straightforward, maybe would make sense to just include here but I'd still prefer to leave as is

fvankrieken commented 2 months ago

Latest build still running

damonmcc commented 2 months ago

agree that adding new sources to the exported FGDB is a pain and worth doing after this

just to note, one insight I've seen come from the export part is unexpected geometry types

damonmcc commented 2 months ago

Since these datasets are not buffered, I currently have buffer as null in their staging/buffer tables, then coalesce at int_buffers__all. Do we feel like this is unsafe?

It doesn't seem unsafe, maybe isn't consistent with something like the airports that aren't actually buffered either. But down to go with this as is and revisit how we handle non-buffered buffers (call them all spatial or something else that's inclusive of buffered and non-buffered source data)

fvankrieken commented 2 months ago

Build is currently failing due to testing expected values, from nondeterministic row_number here

I'll work on a fix - definitely a good thing to not be having random numbers in an output table. Moving towards just unioning these things (which isn't as much of a performance cost thanks to hexes)

fvankrieken commented 2 months ago

Build now passing, added some helpful comments in a final rebase. Good to go

sf-dcp commented 2 months ago

I probably won't get to PR review until tomorrow, and I don't want to hold you back. Feel free to merge without my input

fvankrieken commented 2 months ago

I probably won't get to PR review until tomorrow, and I don't want to hold you back. Feel free to merge without my input

Sounds good - I added a few helpful code comments in your honor!