Green Fast Track support

AmandaDoyle commented 4 months ago

Project Description

DE will extract source data, do geospatial calculations, and produce outputs needed for the new Green Fast Track survey GIS is making.

Timeline

Friday 2/23

For a demo of the POC, DE and GIS will focus on a subset of CEQR categories: Zoning, E-designation, Air, Noise.

[x] https://github.com/NYCPlanning/data-engineering/issues/576
[x] https://github.com/NYCPlanning/data-engineering/issues/575
[x] Add build logic
[x] Export FileGeodatabase for use in survey
- https://github.com/NYCPlanning/data-engineering/issues/590
[x] Show the Green Fast Track POC

Wednesday 3/20

Planning Support will host a forum with applicant reps for feedback on Eligibility Tool. The goal is to confirm it would work for their workflow.

[x] By 3/6 share survey language with Planning Support
[x] NYCPlanning/edm-overview#1029
[x] On 3/11 Review Tool with Planning Support and EARD
[x] By 3/14 discuss outline of Tool demo and get outputs for pilot projects
[x] Demo at Applicant Forum for ~10 minutes

Friday 4/19

Improve GFT tool by incorporating remaining variables and critical feedback.

[x] #708
[x] #741

June 3 public launch

[x] https://github.com/NYCPlanning/edm-overview/issues/1033
[x] #814
[x] #861
[x] #872
[ ] #664
[ ] confirm GFT build aligns with Data Source Review doc
[ ] https://github.com/NYCPlanning/edm-overview/issues/1044
[ ] https://github.com/NYCPlanning/edm-overview/issues/1012
[ ] https://github.com/NYCPlanning/edm-overview/issues/1038

after public launch

[ ] #816
[ ] add source data documentation to GFT app

Background

The City Environmental Quality Review (CEQR) process identifies and assesses the potential environmental impacts of land use actions that are proposed by public or private applicants.

DCP plans to streamline housing construction by allowing potential applicants for land use actions to determine if their project is minor enough, known as CEQR Type II, and therefore exempt from environmental review. This is now known as the Green Fast Track Eligibility Tool or the GFT tool.

GIS is creating a survery for potential applicants to use and determine if their project is CEQR Type II .

The determination will be based on the project area. A project area is defined as all relevant tax lots. For each tax lot, all relevant CEQR considerations must be checked.

Potential geospatial logic

Zoning districts

wholly within an existing R5 through R10 Residential zoning district. Logic: The 4 zoning district fields must only include R5 through R10 values and be absent of any other zoning district.
partially within an existing R1 through R4 Residential zoning district. Logic: An R1 through R4 value appears in 1 of the 4 zoning district fields
within in a Special Coastal Risk District. Logic: One of the 3 special purpose district fields contains a CR% value. CR - n translates to Special Coastal Risk District, where n is the number of the district.

Air quality Add three binary fields indicating if the tax lot intersects with any of the following buffers:

400 Feet of an active or expired CATS permit. Data Source: https://www.ceqr.app/data/air-quality
1,000 Feet of a State Facility Permit Source. Data Source: https://www.ceqr.app/data/air-quality
1,000 Feet of a Clean Air Act Title V Permit Source. Data Source: https://www.ceqr.app/data/air-quality The dataset are already being ingested in workflows for the CEQR Air. We'd need to confirm that: 1) the CATS permit data that we're pulling in contains expired permits and 2) how often will we have EARD manually review and geocode the data, which they currently do with each update.

Arterial highways and vent structures

Buffer arterial highways (lines) and vent structures (polygons) by 75 feet and flag if a lot intersects with the buffer.
Arterial highways data source: Appendix H zip file in project folder from TRD. Question for GIS and TRD: can we make this data part of an existing regular data exchange?
Vent tower data source BBLs: 1005950090, 1006800001, 1000180100, 1006560009, 1006650020, 1013530012, 3005040050, 4000130025 - Question for PS: How was the vent tower list created and how will it be maintained? Question for GIS: Is there a better data source?

Elevated subway or railway
Logic: if a lot intersects a 1,500 foot buffer of an open cut, at-grade, and elevated NYC Subway, SIR, MNR, LIRR, AirTrain and other freight lines it gets flagged. Data source: TBD. LION? Or manually created file from PS?

Airport

Logic: A lot intersects the 2021 65 DNL Contour for LGA or JFK airports. Data source: TBD. Need to get data from contact and establish update procedure

Natural Resources and Shadows

Logic: A lot intersects a 100 foot buffer of the following natural resources
State Regulated Freshwater Wetlands and Checkzone: https://data.gis.ny.gov/maps/a57e144caedb4b1aaf510809013e4ac7/about Wetlands that are currently mapped or officially proposed for addition to the wetland maps and currently regulated under the NYS Freshwater Wetlands Act.
Tidal Wetlands: https://data.gis.ny.gov/datasets/661acb5eaffb4be39b0d6d2203e636c3_1/explore?location=40.889803%2C-73.055095%2C9.87
Priority Waterbodies: https://data.gis.ny.gov/maps/fe6e369f89444618920a5b49f603e34a/about Lakes, Estuaries(includes Rivers), Streams, Shorelines
Significant Natural Communities: https://data.gis.ny.gov/datasets/0da09cdf37d549b1be9add9b522ee505_0/explore?location=40.601737%2C-73.998364%2C11.83
Beaches: Need to determine data source. Is it the parks properties layer or encompassed in another existing natural resource dataset?
National Wetlands Data: https://fwsprimary.wim.usgs.gov/wetlands/apps/wetlands-mapper/ I think there is a code in here for land mass, we would want that removed Question for PS: Does someone need to know what a lot intersects with, or can all of the above be grouped into one flag?

Historic and Shadows

Project contains a historic resource
Project is within 90 feet of a historic resource Data sources:
LPC NYC Individual Landmarks, Historic Districts, and Scenic Landmarks: PLUTO historic district and landmark status fields OR https://data.cityofnewyork.us/Housing-Development/Individual-Landmark-and-Historic-District-Building/7mgd-s57w and https://data.cityofnewyork.us/Housing-Development/Scenic-Landmarks-Map-/gi7d-8gt5
NYS OPRHP National Register Historic: https://data.ny.gov/Recreation/National-Register-of-Historic-Places-Map/y36f-mkpp Need to contact them to ask what the appropriate data to use is.
NYS OPRHP State Register Historic: TBD Need to contact them to ask what the appropriate data to use is.

Open Space/Shadows

Logic: If a tax lot touches a 75 foot buffer of the following
NYC Parks properties https://data.cityofnewyork.us/Recreation/Parks-Properties-Map/krz2-j7bn Remove parking lots and parkways (roads) from data
NY State Parks Property | NY State Parks Property | NYS GIS Clearinghouse: https://data.gis.ny.gov/datasets/nysparks::ny-state-parks-property/explore?location=40.843794%2C-73.902887%2C10.95
POPS: https://data.cityofnewyork.us/City-Government/Privately-Owned-Public-Spaces-POPS-Map/qeta-4kqg#:~:text=Privately%20owned%20public%20spaces%2C%20also%20known%20by%20the,into%20New%20York%20City%E2%80%99s%20zoning%20regulations%20in%201961. PS would prefer to have a polygon dataset. Can we get POPs polygons?

E-designations

Report what E-designations are on the tax lot Need to determine if there is any logic beyond this (i.e. joining to e-designation table to get type of e-designation)

croswell81 commented 4 months ago

@AmandaDoyle @jackrosacker and I just reviewed this issue. Our comments and questions are below. In general we still need to review the data sources which we intend to start this week.

Zoning districts (Zoning Values within PLUTO) • Is one zoning district range being assigned to a single column or are we checking off if those zoning ranges apply to the project in three Boolean columns? • Is there a hierarchy for the different zoning district ranges: R1 through R4, R5 through R10, C or M? (i.e. if a BBL falls into multiple categories is one range assigned over the others)

Air quality – GIS Team needs to vet the sources

Arterial highways and vent structures – GIS Team needs to review the sources • Arterial highways: if a project is located next to multiple features, do we need to list the name of all or just one? Closest? • Arterial highway source – is this just the arterial highways in the DCM_ArterialsMajorStreets open dataset • Vent tower: is this a distinct question or part of the arterial highway question? Do we need to list the name of the vent?

Elevated subway or railway • if a project is located next to multiple features, do we need to list the name of all or just one? Closest? • Source: GIS Team to compare LION vs data received from PS

Airport – GIS Team needs to vet the source • Confirm if EWR is excluded

Natural Resources and Shadows – GIS Team needs to review the sources (who determines what source is valid?) • if a project is located next to multiple features, do we need to list the name of all or just one? Closest? • Noticed a state and federal wetland dataset – should we use only one? How do we handle conflicts? • Beaches: need to identify a source

Historic and Shadows – GIS Team needs to review the sources • if a project contains or is located next to multiple features, do we need to list the name of all or just one? Closest?

Open Space/Shadows – GIS Team needs to review the sources • Should we include federal park properties?

E-designations • We need to create a field for each type of e-designation (noise, air quality, hazmat)? • For lots with multiple e-designations, should they be concatenated into one field? • Should we include restrictive declarations (i.e. e-numbers that start with “R”) • Source: e-designation table (csv)

fvankrieken commented 4 months ago

Just wanted to clarify - who is point person (on our end) for questions for PS?

fvankrieken commented 4 months ago

("our end" meaning GDE, not DE)

AmandaDoyle commented 4 months ago

@croswell81 and @jackrosacker please see my answers below

Zoning districts (Zoning Values within PLUTO) • Is one zoning district range being assigned to a single column or are we checking off if those zoning ranges apply to the project in three Boolean columns? • Is there a hierarchy for the different zoning district ranges: R1 through R4, R5 through R10, C or M? (i.e. if a BBL falls into multiple categories is one range assigned over the others)

If it's possible to calculate the Zoning District for the CEQR II form using the existing 4 zoning district fields in PLUTO that may be preferable. To be considered for the R5 through R10 Residential zoning district the 4 zoning district fields must only include R5 through R10 values and be absent of any other zoning district. Whereas to be considered as an R1 through R4, an R1 through R4 value must simply appear in 1 of the 4 zoning district fields. A BBL is assigned a zoning district if 10% or more of the bbl's area is covered by that zoning district. Zoning district values are assigned from most covered to least coverage, with Zoning District 1 covering the greatest area and Zoning District 4 covering the least area. For CEQR II we just need to know if a project is R5 through R10 or R1 through R4.

Arterial highways and vent structures – GIS Team needs to review the sources • Arterial highways: if a project is located next to multiple features, do we need to list the name of all or just one? Closest?

We just need to know if a project is near an arterial highways or vent structure, or not. This would be a boolean value; we do not need to report any name.

• Arterial highway source – is this just the arterial highways in the DCM_ArterialsMajorStreets open dataset

That is a question for Andrew E.

• Vent tower: is this a distinct question or part of the arterial highway question? Do we need to list the name of the vent? You do not need to list the name of the vent. Check with Planning Support if vents need to be reported separately from highways.

Elevated subway or railway • if a project is located next to multiple features, do we need to list the name of all or just one? Closest? • Source: GIS Team to compare LION vs data received from PS

We just need to know if a project is near an elevated subway or rail, or not. This would be a boolean value; we do not need to report any name.

Natural Resources and Shadows – GIS Team needs to review the sources (who determines what source is valid?) • if a project is located next to multiple features, do we need to list the name of all or just one? Closest? • Noticed a state and federal wetland dataset – should we use only one? How do we handle conflicts? • Beaches: need to identify a source

Like the above, we just need to know if a project is near a natural resource and not what it is. Given that we just need to know if something is near something we don't need to worry about conflicts. I proposed GIS team recommend what the data source should be used.

Historic and Shadows – GIS Team needs to review the sources • if a project contains or is located next to multiple features, do we need to list the name of all or just one? Closest?

Same as above.

Open Space/Shadows – GIS Team needs to review the sources • Should we include federal park properties?

I'd ask PS.

E-designations • We need to create a field for each type of e-designation (noise, air quality, hazmat)? • For lots with multiple e-designations, should they be concatenated into one field? • Should we include restrictive declarations (i.e. e-numbers that start with “R”) • Source: e-designation table (csv)

We need to work through e designations with PS and EARD.

Regarding GDE POC to PS and EARD - I'd prefer to not play telephone, so whoever is doing the work should reach out directly, but loop me in on all communications and come to me first if there are any questions you have that you want to discuss internally before reaching out to PS and EARD.

damonmcc commented 4 months ago

from sprint planning on 1/29

awaiting confirmations of product specs and data sources
in the meantime, would be nice to draft the build architecture for this new product
- expected to build monthly
- will not depend on PLUTO outputs
- but if the spine of this new product is identical to PLUTO (primary BBLs), maybe it should be part of the PLUTO build

damonmcc commented 3 months ago

from sprint planning on 2/13

DE and GIS will prioritize 3 categories for a demo on Friday 2/23: Zoning, Air, Noise

GIS needs a version of the new data to load into the survey map

Jack has been using a dummy dataset to prototype the map

croswell81 commented 3 months ago

I am cleaning up the CEQR_Type_II_Data_Source_Review.xlsx doc. I will let you know when all edits are completed.

damonmcc commented 3 months ago

We've imported all source data and confirmed output specifications for the POC

Now working on build logic. To help keep GIS unblocked, we'll export mock data by tomorrow morning.

damonmcc commented 3 months ago

notes from Survey demo on 2/23

nice
would prefer auto-population of survey
- requires changing radial buttons to text
- difficult to pass along values from multiple BBL records
for some questions, an applicant would want to know values of all adjacent features, not just the nearest (e.g. all CATS permits, but not all elevated railways)
have to address applicant flexibility
- e.g. line of site vs within 1,500 feet of elevated railway
- allowing applicants to challenge an auto-populated answer as part of their submission
- some survey questions cannot be addressed with available source data. some can only be answered by an applicant

jackrosacker commented 3 months ago

Noting a couple of data items that popped up last week that I don't want to forget:

Arterial Highways will need to have the 75' buffer applied from road edge, not centerline. This could likely be done by using the centerline data to select and then buffer planimetric data. Rodrigo or Matt are probably the best resources for understanding the planimetric dataset - might be a flag or attribute to select relevant features without using the centerline.
It looks like we will need to merge/dissolve all buffers per category for cartographic purposes. I ended up doing this manually for the demo because I couldn't get the multi-buffers to work visually. This results in lost resolution in terms of what a user can see via popup, but I think that's ok.

croswell81 commented 3 months ago

Since Arterial Highway is such a specific dataset, I would recommend keeping DCM arterial centerline data and doubling the buffer to 150 feet. We should confirm with Planning Support and EARD but it would be a much easier task than trying to recreate the arterial highway list from another source. @jackrosacker

jackrosacker commented 3 months ago

Could we use the centerline to select associated planimetric features within a set distance, and then base the 75' buffers off of those selected features? (Damon's idea from last week I think)

croswell81 commented 3 months ago

It would take someone going in and cleaning up all other roadways that are within that 75 buffer. The planimetric data doesn't have attribute data to narrow down what could be selected.

croswell81 commented 3 months ago

Also, we should remember this is a flag, not a precise calculation. Using the Arterial Highway is the most accurate dataset, that allows us to use an existing public resource. Adjusting the buffer is the better way to go.

The planimetrics dataset is derived from aerial images and is updated every 2-4 years, not as reliable.

jackrosacker commented 3 months ago

Okie doke. We can run the increased buffer idea by folks in the pm meeting today

damonmcc commented 3 months ago

@croswell81 @jackrosacker

During one of our chats with Planning Support, there was mention of an Air Quality check I don't think we're doing yet: "are you within 400 feet of a manufacturing land use?"

It seems like the logic for this flag would be something like:

For a single project lot, find all lots within 400 ft
For all of those adjacent lots, are any of them manufacturing?
- a lot is manufacturing if any of it's zoning district values start with M
If so, flag as True and list the BBLs

And the source data to pass along for the map would be "all manufacturing lots"

Does that sound right?

jackrosacker commented 3 months ago

That sounds tentatively correct to me, pending any changes from Matt.I added a placeholder question for this in the survey as well, until we have final language. I forwarded you the email from Alex with the initial question, and placing the email text here:

Hi ITD Team,

We had a discussion with Stephanie today that included the desire for an addition to the air quality section of the tool. We currently do not have any logic to flag the potential for an unpermitted air quality source, and it is just in a note necessitating a manual check. We are hoping it may be possible to query within a 400 foot buffer from the site if there is any lot with a manufacturing land use (using PLUTO). If there is, we would flag that they need to check this study area for unpermitted industrial sources.

The question would be something like: Are there any manufacturing or processing facilities operating within 400 feet of the Development Site that may be unpermitted sources of air quality emissions?

We are happy to use some of our time next week to discuss.

Definitely an item to work on further with them.

croswell81 commented 2 months ago

@damonmcc @jackrosacker They are looking to identify uses, not zoning, so the query or filter should use Land Use code = 06.

We should confirm with Planning Support and EARD there are no other Land Uses or building codes that allow "processing facilities" besides '06'.

jackrosacker commented 2 months ago

@damonmcc, Alex shared this pilot area with me: 3285 Fulton Street BBLs available in the ZAP link

And a second pilot area for BBL 4122770001. I'll see if we can add you to the PPT, I'm not sure how much of the draft doc info is ok to live here.

jackrosacker commented 2 months ago

@damonmcc et al, I spotted a couple of possible issues with the zoning classifications within the green fast track bbl dataset. Item 1: Lots with some R1-R4 present are getting classified as R5-R10 I this example, lot 1 (red underline) is classified as R1-R4, while lot 70 (blue underline) is classified as R5-R10. I think both should be R1-R4, since any part of the lot in both cases is within that range. I could be wrong about this, so let's discuss before you start re-engineering anything.

Item 2: Lots with no zoning category at all See BBL 4124950002

I haven't done any form of comprehensive error checking yet, just flagging these as they pop up

fvankrieken commented 2 months ago

From my understanding, we (AD and PS) agreed on "pluto classification" as truth for whether R1-R4 (or containing any specific district). That threshold is 10% of the district.

croswell81 commented 2 months ago

@jackrosacker @fvankrieken beat me to it. We don't assign zoning to PLUTO lots if it less than 10% and that will apply to this app and data as well.

fvankrieken commented 2 months ago

I'll look into 4124950002

jackrosacker commented 2 months ago

Ok, I'd lost track of the fact that we're following the 10% rule for GFT as well. I'll add that to the zoning aggregation description in my presentation.

@fvankrieken - seeing ~10k lots without any zoning classification in what I believe is the latest version of the dataset. There might be a legitimate reason for some of these that's eluding me

fvankrieken commented 2 months ago

Current implementation is based on the original zoning logic, which said one of

wholly R5-R10
partially R1-R4
wholly C/M

4124950002 is R6 and C4, meaning it fits none of those three categories. I think you and @damonmcc had come up with a revised little decision tree, which will need to be implemented.

Though still not quite sure what this should be "flagged" as? For this specific case of R6 and C4?

jackrosacker commented 2 months ago

Pretty sure this one would be a C or M lot. My understanding of the language is that it only requires the presence of C or M, not necessarily "wholly."

This is probably a post-demo discussion. Sounds like we should discuss and refine the zoning decision tree, and choose a non-null text value to indicate lots that are in none of the buckets above ("Other", "Ineligible", etc.)

As I'm writing this I'm wondering if we also need to account differently for lots that are a mixture of R1-R4 and C or M.

croswell81 commented 2 months ago

After the demo, maybe GIS and DE can get together again to review the logic of each field, including updating buffer distances.

jackrosacker commented 2 months ago

Logging some thoughts here for how to account for lots with both R1-R5 and C or M present. Each option represents a hypothetical project site in which four lots are selected, each with the following zoning values.

Option 1:

Table has a single zoning category column, with four possible options per BBL. BBL	Zoning Category
1111111111	R1-R4
2222222222	R5-R10
3333333333	C or M
4444444444	R1-R4 with C or M

Option 2:

Table has two zoning columns, resi and c or m. There are still four possible zoning options per BBL, but pulled from values combined across both columns. This example uses the actual value within the C or M column, not just a Yes/No option.

BBL	Residential Zoning	C or M Zoning
1111111111	R1-R4
2222222222	R1-R4	C or M
3333333333	R5-R10
4444444444		C or M

Option 2b:

Table has two zoning columns, resi and c or m. There are still four possible zoning options per BBL, but pulled from values combined across both columns. This example uses a Yes/No value within the C or M column, not the actual value.

BBL	Residential Zoning	C or M Zoning
1111111111	R1-R4	No
2222222222	R1-R4	Yes
3333333333	R5-R10	No
4444444444		Yes

For our upcoming GFT data meeting @croswell81 @damonmcc @fvankrieken (since I think you're tackling the zoning stuff?)

damonmcc commented 2 months ago

@jackrosacker should Option 2 have more specific values in the C or M Zoning column? it says "This example uses the actual value within the C or M column, not just a Yes/No option."

damonmcc commented 1 month ago

notes from DE & GIS chat on 4/2

Lot Zoning info

R1-R4 can exist with C or M
R5-R10 cannot exist with C or M
for current NULL values use one of these: Ineligible, Not Applicable, Other

Natural Resources

survey questions:
- does your porject site contain a natural resources?
- is your project site near a wetland check zone?
- Are you near another ... ? (not data-driven)
App Table vs Output CSV
- show single binary Natural Resources answer
- show all columns related to that answer

Historic Resources (Alex)

distinction b/w resource and district
- resource = point
- district = polygon
"does your project contain" (look for points)
"are you near" (buffer lots that contain points or buffer points)

Rail

GRU is gonna make changes for 24B

Rail Yards

GIS can use polygons (Complex Polygons)
not in CSCL because they're Complexes

Beaches

Non-DPR parks have no names

damonmcc commented 1 month ago

@jackrosacker @croswell81

from our chat about Lot Zoning vs Project Zoning, I tried to illustrate the logic we'd implement in order to do Option 1 above (one Zoning column). I'll put the diagram in this comment and this link to the PR it's coming from

Logic for Lot Zoning in GFT

croswell81 commented 1 month ago

@damonmcc @jackrosacker This looks correct to me.

jackrosacker commented 1 month ago

Agreed, with one amendment: Any R5 - R10? should be Entirely R5 - R10?

@damonmcc @croswell81

damonmcc commented 1 month ago

thanks @jackrosacker! I think I see what you mean about changing to Entirely, so I revised it to be the diagram below.

I wanted to still use Any because that frame of mind translates really well to the logic/code we'll have to write. Hope this captures it!

jackrosacker commented 1 month ago

Cool, yeah that seems to cover more bases. I wonder if there's ever a circumstance in which there's a third branching option from the Any R5-R10 in which the lot is partially Other, meaning that the lot could not be classified as "Entirely R5 - R10". Or other possible edge cases? I'll leave subsequent comments in the #741 issue

@damonmcc

jackrosacker commented 1 month ago

@damonmcc I wanted to also suggest that we find and record a BBL for each possible zoning combination to use as benchmarks. If that's easy for DE to gather while writing the queries, could you add a list of BBLs to this issue? I'm happy to poke around and find them as well if helpful.

jackrosacker commented 1 month ago

@damonmcc

Notes from Damon <> Jack on 2024-04-09:

Matching report to data version (i.e. facilitating EARD matching a submission date on a survey report to the relevant data from that time period)
- Primary option
- Could we bake GFT versions into the survey/report?
- Use @pulldata function in survey to retrieve build info from feature service table (published from source_data_versions in data GDB)
- Find way to default roll-up section within survey
- Secondary options
- Could pull data versions from feature table in GDB?
- Could pull two CSVs, one for tax lots, one for data versions?
- Could list data versions and links on data sources?
Archiving datasets
- All source data will be archived in edm-recipes
- No deletion time period planned
- Won't be readily available to EARD initially, but can be pulled by GDE upon request - can evaluate over time
Data source page
- DE will own it
- Phase 1:
- Wiki (example)
- Build date at top of page
- Table with:
  - Dataset Name, Field Name in App(?), Source link/description
- No updates required, unless data sources are added/removed
- Phase 2: More transparency on build logic, possible download of CSV etc. with dynamic build info
Tool how-to, FAQ, changelog
- Could be cumbersome to have in DE repo, may be cumbersome to have in GH at all
- Changelog
- Repo/md/wiki OR live in app
- If in GH, will live in a GIS repo
- If need to reference data changes, link to DE document, rather than updating two independent locations
- How-to
- Should live in app as much as possible
- FAQ – lower priority, will likely mimic whatever we decide for Changelog
Zoning
- Some test logic issues surrounding "Entirely R5-R10" when C or M values are also present – DE is updating tests
- DE will send an updated CSV of expected values per example lots, GIS will review and forward to PS/EARD

jackrosacker commented 1 month ago

@damonmcc I've been able to successfully publish the source_data_versions table to ArcGIS Online, and pull the dataset name and vers/date values into the survey as static values. This will enable us to print out the data versions at the bottom of each report PDF for easy reference. A few things occurred to me while implementing this:

Are the column headers finalized for this table? The naming convention doesn't matter at all, but the report will break if we change convention later
Noticing variability in the date format in the 'v' column. I personally prefer the 1900-01-01 format. Requesting that we normalize to one format
Are we confident that each of the version/date values are getting updated when each build happens? I don't see anything to indicate otherwise, but want to be sure these values are accurate before printing into the report
Do the dates indicate a specific thing? i.e. date the data was ingested for external, date the data was built for internal, etc.

This doesn't have to happen before the Beta.

damonmcc commented 1 month ago

@jackrosacker

Are the column headers finalized for this table? The naming convention doesn't matter at all, but the report will break if we change convention later

The column headers are finalized. This table is generated by code that we use in all builds to load source data.

Noticing variability in the date format in the 'v' column. I personally prefer the 1900-01-01 format. Requesting that we normalize to one format

Definitely possible. Since these are versions of source data, DE kind of treats these more like strings than dates. So changing the format sort of breaks the 100% certainty that we'll find that exact value in edm-recipes, but no worries! we're talking about formatting to display and DE can always retrieve the pre-formatted value if we need to.

Are we confident that each of the version/date values are getting updated when each build happens? I don't see anything to indicate otherwise, but want to be sure these values are accurate before printing into the report

Yup this file green_fast_track/recipe.yml ensures that every build uses a particular version (in this case the latest version) which is then documented in the source data version table.

Do the dates indicate a specific thing? i.e. date the data was ingested for external, date the data was built for internal, etc.

This varies by dataset, specifically the source of the dataset. If the created data is programmatically available, the version is the created date. If it isn't available, the version is the ingested date.

Everything we ingest from NYC Open Data or an ArcGIS feature server uses the "last updated" value as the version.

We'd like to make this more clear for all source data and perhaps could reflect the details by one or more new columns in this table, but don't plan to do that soon. We may be able to describe what the version means for each dataset in DE's GFT documentation though!

jackrosacker commented 1 month ago

@damonmcc thanks for this. Glad to hear that the data is reliably up to date, and that the field names aren't expected to change. Based on what you're saying, I think we should:

leave the date formats alone, unless there's another reason as well to align them
add a column to indicate what the date means for each dataset - if this needs to happen down the road in the table that the build outputs, we can manually enter values for each row in the GH data sources page and on the table at the end of the PDF report (Casey built a solid first draft of this, we can share with you next week!)

damonmcc commented 1 month ago

potential source data changes/additions

Exposed Rail Yards
- confirming how to use CSCL
Beaches
- no need for names, may just list municipality (city, state, federal)
Ecological Complexes
Unpermitted Industrial Lots
- will decide on lots vs zoning
Groundwater
- need DOB BIS data
DPR changes
- waiting for confirmation on filter we use for DPR park properties

source data details

Human-readable descriptions have been added to report manually
EARD will likely have to get source data based on a report
DE will determine what the "version" values for each data source indicate (date extracted, date modified, etc.)

GIS

Deadlines
- Beta - what is it, how long is it, what specific feedback do we need to ask for from applicants
PS meeting next week
Approvals
- Sharing strategy for beta
- Custom URL
Blockers
- Zoning data determination
- DE will update categories and provide column of a lot's zoning districts to show in the popup
- will confirm PS's answers in the excel sheet
- Rest of the data categories
- notes above, awaiting confirmation on most
- Finalization of survey language – feeds report + survey
- Report automation – can't avoid FOILability

damonmcc commented 1 month ago

from @jackrosacker on 5/1

priorities re DE items for the Beta launch on 5/3

Priority 1

[done] bbl dataset with new zoning column
- https://github.com/NYCPlanning/data-engineering/issues/741
[in progress] bbl dataset with per question flag columns
- DE review of my assumptions in column D of the Data Source sheet OR we finalize the new sheets we've been working on to verify any assumptions in matching datasets
- https://github.com/NYCPlanning/data-engineering/issues/814
[data in review] Add missing historic resource lots
- https://github.com/NYCPlanning/data-engineering/issues/815

Priority 2

Shadow/Historic Resource buffer (I'm going to mock this up in the meantime so I have a dataset to visualize, can swap out the data source when this is ready from Alex)

Priority 3

DE query out all successfully lot-joined points (potential - I'm still playing with whether I can cleanly do this on our end)

jackrosacker commented 3 weeks ago

@fvankrieken and @alexrichey, following up on our conversations this week with a punch list of Fast Tracker items. I've tried to be as comprehensive as possible, but there are a few items such as dataset aggregation and column/dataset naming review that will probably benefit from a closer look together before DE puts in too much work.

Also - this list alludes to but does not fully cover the remaining datasets to be added. These fall into two buckets: (1) datasets that have been cleared by Matt/Planning Support and handed off to DE but haven't been processed yet, and (2) datasets that are still being digitized/approved.

I'm out of office this afternoon but could talk through and help prioritize the below list tomorrow if either/both of you are available.

Create new Historic Resource buffer (in progress)
Incorporate all outstanding datasets that have been made available by Matt etc. (e.g. rail yards, DOB wetlands) - Matt will have more detailed information on these
Update dataset names for consistency
- No huge changes needed, just to review conventions across entire population of datasets and ensure rough consistency. Happy to do this review together.
Update arterial hwy buffers - still using the old distance. Review all other buffer distances for other possible
Create a data sources page in Github - beta is currently linking straight to root of DE GH
Update green_fast_track_bbl dataset:
- Finish adding per-question flag columns
  - Existing, in 4/25 GDB:
  - State_Regulated_Freshwater_Wetlands___Checkzone_Flag
  - State_Archaeological_Areas_Flag
  - Natural_Resource_Shadow_Flag
  - Needed, calculated by GIS Team from 4/25 GDB:
  - Natural_Resource_Flag
  - Historic_Districts_Flag
  - Historic_Resource_Flag
  - Historic_Resource_Adjacent_Flag
  - Open_Space_Shadow_Flag
  - Historic_Resource_Shadow_Flag
- Add Natural Heritage Communities name/id column
- (optional) Ensure that any outstanding datasets have a corresponding column created, even if all values are null)
- Edit field names for consistency:
  - Name character count <= 25 characters (see existing, some are as long as 52 char)
  - Leading character must be alphabetical (see existing, some with leading '_' char)
  - Maintain naming conventions across the entire dataset: i.e. If multiple admin levels of a dataset are included (i.e. historic resources) make sure that each corresponding field explicitly names that level (i.e. nyc, nyc, us)
  - Aliases don't matter for now - I re-alias anyway on my end
- Take note of any fields that keep the name but change meaning. See Damon's example re Historic Districts: "And there's one flag that will be renamed (and not show in the tool anyway) as to not conflict with a question flag: Historic Districts -> City Historic Districts."
Spelling of LaGuardia in airports dataset -> currently all lowercase (this is very low priority! just getting it in here)
Return all point datasets as filtered to include only those points that have not been successfully joined to a lot (this is already the case for, say, historic resource pts, but not for CATS Permits)
Create single dataset per question and per geometry type for the latter sections: Natural/Historic/Shadows
- For buffers: merge and dissolve all constituent datasets into a single feature
- For features: merge all constituent datasets and retain variable_type and variable_id values for each feature/row
- Examples, using the datasets I manually produced (I used a "gft_" prefix, that doesn't have to remain static):
  - gft_shadow_open_spaces_buffer
  - gft_shadow_open_spaces_lots
  - gft_historic_resources_points
  - gft_historic_resources_lots
  - gft_historic_resources_buffer
  - gft_historic_districts
  - gft_natural_resources
- (I have a more in depth table of these transformations and examples of the outputs produced, if helpful to review together)

cc @croswell81

jackrosacker commented 3 weeks ago

@damonmcc @croswell81 I added data to our running GFT Data Sources sheet.

Added columns to the Survey Questions tab:
- App Flag Alias -> these values are what I'm currently using in the app as aliases for the flag columns
- Possible Flag Field Name -> these are possible green_fast_track_bbl field names, based on my aliases and your existing field names (my alias uses "elevated" because that's in the survey, but DE switched to "exposed" which I like)
Added a Flag to Name Pairing tab to capture the name/id aliases that I'm using, paired with the relevant flag. I haven't translated these into gft bbl field name ideas yet

Take a look when you get a chance and we can regroup as needed.

croswell81 commented 3 weeks ago

@damonmcc @jackrosacker Latest data updates (refer to bullet 2 in Jack's comment from 5/8 above). All data updates are reflected in the CEQR Type II Data Source Review doc.

Updates:

Zoning data: signed off (marked approved)
Exposed Railway (marked as approved) a. received updated version from GR b. can send via Teams or you can use latest LION and process - let me know what you prefer
Exposed Railyards (marked as approved) a. received queries from GR b. instructions are provided in the Data Processing and GDE Notes columns c. can send via Teams or you can use instructions to process - let me know what you prefer d. GIS will provide the Railyard_HudsonYard_erase.shp in Teams
Beaches (marked as approved) a. Updated the Name column with new values that should be returned in the CSV export table b. Will provide new data in Teams
NYCDOB Natural Resource Check Flags (marked as approved) a. GIS provided data in Teams – let me know any questions b. One spreadsheet contains flags for following: i. Tidal Wetland ii. Freshwater Wetland iii. Coastal Erosion Hazard Area
DPR Park Properties – No changes, keep as is, we never heard back from Parks

Pending:

Recognized Ecological Complexes (RECs) – working out methodology for this. Planning Support understands this will take more effort and OK it is not ready by Go Live. Will still try to get this ready for the 6/3 deadline.

jackrosacker commented 2 weeks ago

Noting to @damonmcc and @croswell81 that as I understand our design, the output CSV will have a single name/id column per dataset, meaning that e.g. if a lot both intersects with a tidal wetland (Nat Res question) and is also within 200ft of that tidal wetland (Shadow question), the name/id of that feature will appear once(?) in the single name/id column for that dataset, but will not differentiate in the export which question or flag it is associated to.

I think this makes sense to some degree, but had been vaguely assuming that the tidal wetland id would appear under a column for the natural resources intersection and another for the shadow buffer.

Does this line up with your understanding? Or am I missing something and we're planning to reflect that same tidal wetland ID in two columns, one per question/section?

croswell81 commented 2 weeks ago

@jackrosacker I was thinking that any resource that triggers a buffer that intersects a project lot would only be in one column (i.e. historic dist (contains, within 90 ft, within 200 ft - shadow) but realize the buffers are different and therefore all resources will not apply to all buffers and questions.

I think we should send to Planning Support and see if they care before we have DE add a bunch of new columns to the export table. cc: @damonmcc

jackrosacker commented 2 weeks ago

Started to plot this out in advance of emailing PS, and ran into a few other wrinkles. Let's take an example project like below:

Iteration 1 - Our current design would have a single CSV column per dataset, regardless of how many times that dataset relates to the project, or through which question/spatial relationship:

BBL	NYC Hist Res ID	NYS Hist Res ID
3030530013	F. J. Berlenbach House
3030530016	F. J. Berlenbach House
3030530019	F. J. Berlenbach House

Iteration 2 - The alternative we discussed above, which ends up adding a column per question and per dataset, so that for each BBL you know the dataset, feature ID, and question/spatial relationship relevant: BBL	NYC Hist Res ID	NYC Hist Res ID - Adjacent	NYC Hist Res ID - Shadows
3030530013	F. J. Berlenbach House
3030530016		F. J. Berlenbach House
3030530019			F. J. Berlenbach House, EXAMPLE RESOURCE FROM SAME SRC DATASET

Iteration 3 - A third option, in which each question has corresponding data name/id field in the csv, and names/ids are grouped with a categorical prefix to indicate the relevant data source (imaginary datasets in all caps to demonstrate how multiple datasets would be aggregated into a single column): BBL	Hist Res ID	Hist Res ID - Adjacent	Hist Res ID - Shadows
3030530013	NYC: F. J. Berlenbach House
3030530016		NYC: F. J. Berlenbach House, NYC: AN EXAMPLE RESOURCE, NYS: AN EXAMPLE RESOURCE
3030530019			NYC: F. J. Berlenbach House, NYC: AN EXAMPLE RESOURCE

Iteration 1 is the most concise, but makes it harder for an applicant or EARD to review and application and understand how the BBLS, flag datasets, and questions interact with one another. Iteration 2 gives makes the review easier, but substantially increases the number of columns required. Iteration 3 is a combination of the other two, with the benefit of having fewer output fields but data that is harder to parse per CSV cell.

After exploring these directions, Iteration 1 (what we have now) feels the most viable. I don't currently have any other ideas for how to design this, do you two? @croswell81 @damonmcc

Edit: removed numeric values from field names, added an example of a concatenation of multiple features per question/bbl in iteration 2

damonmcc commented 2 weeks ago

@jackrosacker love having an example and those tables!

Iteration 1 is how we had it. While making the changes to have the flag-column-per-question structure of the final table, there's now a single id-column-per-question in that final table.

Iteration 3 is my favorite: one column of ID values per question. But I wonder if having column names like Hist Res - Adjacent would be better than Hist Res ID - 90', so that people can easily relate the column to the question and you won't have to maintain buffer values in alias strings.

If EARD review requires "exactly which dataset did that value come from", Iteration 2 seems like something we can add later in addition to Iteration 3.

croswell81 commented 2 weeks ago

@jackrosacker @damonmcc The example is missing how many of these values would just be repeated, since any lot that has a historic resource will also be in the buffer, or any resource within 200 feet will also be within 90 feet.

My concern is this could potentially add dozens of fields since there are 10+ natural resource fields that go into NR shadow, and another 5-8 historic resources with two buffers, etc.

We should try to meet tomorrow when Jack is available.

NYCPlanning / data-engineering