NYCPlanning / data-engineering

Primary repository for NYC DCP's Data Engineering team
14 stars 0 forks source link

add remaining Green Fast Track variables #664

Open damonmcc opened 2 months ago

damonmcc commented 2 months ago

see meeting notes here and source data spreadsheet here

extract new source data

adding logic for new variables

exporting new source data


GFT dataset details

more GFT dataset details

Potential approach to confirming existing variables

  1. Ensure records in variables.csv cover all expected GFT variables by comparing it to the source data excel file)

Potential approach to adding a new/missing variable (builds/tests fail until the last step)

  1. Add a record to the variables.csv
  2. Create a dcpy/library/templates/ YAML file for the source (if needed)
  3. Add the source to recipe.yml and _sources.yml (if needed)
  4. Create a staging model by listing it in the properties file and adding a script (if needed)
  5. Create a new intermediate model and test variable_id for null and unique
  6. Add the new intermediate model to the list in int_buffers__all
  7. Update and review the pilot project records in test_expected_pilot_projects.csv by copying the table the csv was compared to
damonmcc commented 2 months ago

(not vital, just a thought)

for geometries that are eventually checked for intersections with all lots during int__spatial_flags, it may be nice to distinguish between buffered and non-buffered polygons before that model uses them

they could all still end up in the same table, but that table is currently called int__all_buffers

damonmcc commented 2 months ago

after new Shadows data has been added to GFT, the DAG with the filter --select intermediate is below.

while adding logic to use the new data, I'll try to add a test we've talked about which would warn or error when new variables like nyc_parks_properties don't appear in the final table

Screenshot 2024-03-29 at 7 53 42 AM
damonmcc commented 2 months ago

for some reason during the ST_INTERSECTS part of int_flags__spatial, each lot is "intersecting" twice with polygons from int_buffers__nys_parks_properties

Screenshot 2024-04-02 at 5 19 47 PM

the buffered NY State Parks polygons seem ok though

fvankrieken commented 1 month ago

for some reason during the ST_INTERSECTS part of int_flags__spatial, each lot is "intersecting" twice with polygons from int_buffers__nys_parks_properties

Screenshot 2024-04-02 at 5 19 47 PM

the buffered NY State Parks polygons seem ok though

Was curious if there was an odd data issue so poked around. Duplicated line here!

Damon edit after offline chat: they aren't duplicated

damonmcc commented 1 month ago

looks like int_buffers__us_parks_properties is actually just stg__nys_parks_properties, so the union all in int_buffers_all has duplicates

Screenshot 2024-04-03 at 10 17 11 AM

@sf-dcp looks like we can drop the int_buffers__us_parks_properties model? I don't see any mention of US Parks in GIS's source data excel sheet

sf-dcp commented 1 month ago

looks like int_buffers__us_parks_properties is actually just stg__nys_parks_properties, so the union all in int_buffers_all has duplicates

Screenshot 2024-04-03 at 10 17 11 AM

@sf-dcp looks like we can drop the int_buffers__us_parks_properties model? I don't see any mention of US Parks in GIS's source data excel sheet

Wow that's a great catch! int_buffers__us_parks_properties doesn't use correct stage table. It should use stg__us_parks_properties instead. US Parks is listed as Federal Parks property in the GIS spreadsheet

sf-dcp commented 1 month ago

for some reason during the ST_INTERSECTS part of int_flags__spatial, each lot is "intersecting" twice with polygons from int_buffers__nys_parks_properties Screenshot 2024-04-02 at 5 19 47 PM the buffered NY State Parks polygons seem ok though

Was curious if there was an odd data issue so poked around. Duplicated line here!

It doesn't seem to be duplicated as NYS and NYC properties are different

damonmcc commented 1 month ago

notes from DE & GIS chat on 4/2

Lot Zoning info

Natural Resources

Historic Resources (Alex)

Rail

Rail Yards

Beaches

sf-dcp commented 1 month ago

Hi @croswell81 & @jackrosacker,

I've been working on processing steps for Shadows/Open Space data, and I have questions/concerns. Please see them below by variable name:

If it's easier to meet to go over these questions, LMK!

croswell81 commented 1 month ago

@sf-dcp

I asked Planning Support if wanted to use Global ID for most of the open space datasets that have one, since it is the only unique id, and they were only concerned with name.

sf-dcp commented 1 month ago

Update as of 4/9/24: Shadows/Open Space logic has been implemented except 1 outstanding item (filter WPAA data by status). It appears that WPAA recipe currently uses incorrect Esri link to pull the data, and this is why the status column is absent.

TODO:

croswell81 commented 1 month ago

@sf-dcp the link to the wpaa rest feature service that will be updated is: https://services5.arcgis.com/GfwWNkhOj9bNBqoJ/arcgis/rest/services/nywpaa/FeatureServer/0

croswell81 commented 1 month ago

@sf-dcp BYTES has also been updated with the correct link now.

damonmcc commented 1 month ago

from @croswell81

New Natural Resources dataset. Three check flag fields from DOB, I added each as a separate dataset in the CEQR Type II Data Source Review doc. They are NYCDOB Tidal Wetland, NYCDOB Freshwater Wetland, NYCDOB Coastal Erosion Hazard Area.

One table with bbl and flag (X or null) field for each variable. Join to PLUTO and create a lot based dataset for each of the three variables.

fvankrieken commented 2 days ago

Just adding todos in one place for myself, a bit redundant but just for convenience.

damonmcc commented 2 days ago

@jackrosacker @caseysmithpgh just tagging yall here because you'll have to add aliases for 3 new rows in the source_data_versions table once this is done

fvankrieken commented 2 days ago

@jackrosacker - the data from DOB is just a wide table of flags per bbl. Do you still want this exported with the source data in some way? It's a bit different from say CATS permits, where we look up a bbl but actually use that geometry to create a buffer, rather than just using the source dataset to determine if there's a flag. This is more like E-Des in that way. So if it sounds good, I think it would just make sense to include these flags in the final table without exporting a source layer.

But of course, if it'd be useful to have a feature that's every lot that has these specific flags, we can easily add that. let me know

damonmcc commented 1 day ago

@fvankrieken

@jackrosacker - the data from DOB is just a wide table of flags per bbl. Do you still want this exported with the source data in some way? It's a bit different from say CATS permits, where we look up a bbl but actually use that geometry to create a buffer, rather than just using the source dataset to determine if there's a flag. This is more like E-Des in that way. So if it sounds good, I think it would just make sense to include these flags in the final table without exporting a source layer.

(after chats with @jackrosacker and @fvankrieken)

E-Des is a good comparison for these tabular (rather than spatial) variables. we won't export source data layers for tabular variables

and since the new Exposed Rail Yards will be an input to the existing Exposed Railway question/flag, this are the potential export layer impacts

fvankrieken commented 1 day ago

@damonmcc thanks for the clear write-up.

Railyards are polys. These two have the same flag_id_field_name then, should they have the same variable_type as well? I would lean with keeping them distinct here, but if you think it aligns better with other things to have them the same that's fine too.

Re the dob natural resource flags - these will just get wrapped up in the other natural resources, correct? And if so, what should their variable_type and variable_id be? Since there's one row per bbl in the dataset (so each bbl can only have one at most "NYCDOB Tidal Wetland" flag), maybe just have variable type and id be the same for legibility? "nycdob_tidal_wetland"? or something like that?

damonmcc commented 1 day ago

@fvankrieken

for Rail Yards; agree that a distinct variable_type makes more sense

for the DOB natural resource flags: agree the variable type and ID should be the same here (unlike E-Des where we get distinct IDs). but I think the variable ID should be more legible than the type (like Archeologic Areas)

jackrosacker commented 1 day ago

E-Des is a good comparison for these tabular (rather than spatial) variables. we won't export source data layers for tabular variables

In order to symbolize the DOB features on the map, will they be exported as part of the unioned Natural Resources dataset, or am I understanding that since these are per-lot attributes you won't be exporting any data and I should be displaying a view of the lots dataset with each DOB lot flagged?