NYCPlanning / data-engineering

Primary repository for NYC DCP's Data Engineering team
14 stars 0 forks source link

Green Fast Track support #486

Open AmandaDoyle opened 4 months ago

AmandaDoyle commented 4 months ago

Project Description

DE will extract source data, do geospatial calculations, and produce outputs needed for the new Green Fast Track survey GIS is making.

Timeline

Friday 2/23

For a demo of the POC, DE and GIS will focus on a subset of CEQR categories: Zoning, E-designation, Air, Noise.

Wednesday 3/20

Planning Support will host a forum with applicant reps for feedback on Eligibility Tool. The goal is to confirm it would work for their workflow.

Friday 4/19

Improve GFT tool by incorporating remaining variables and critical feedback.

June 3 public launch

after public launch

Background

The City Environmental Quality Review (CEQR) process identifies and assesses the potential environmental impacts of land use actions that are proposed by public or private applicants.

DCP plans to streamline housing construction by allowing potential applicants for land use actions to determine if their project is minor enough, known as CEQR Type II, and therefore exempt from environmental review. This is now known as the Green Fast Track Eligibility Tool or the GFT tool.

GIS is creating a survery for potential applicants to use and determine if their project is CEQR Type II .

The determination will be based on the project area. A project area is defined as all relevant tax lots. For each tax lot, all relevant CEQR considerations must be checked.


Potential geospatial logic

Zoning districts

Air quality Add three binary fields indicating if the tax lot intersects with any of the following buffers:

Arterial highways and vent structures

Airport

Natural Resources and Shadows

Historic and Shadows

Open Space/Shadows

E-designations

croswell81 commented 4 months ago

@AmandaDoyle @jackrosacker and I just reviewed this issue. Our comments and questions are below. In general we still need to review the data sources which we intend to start this week.

Zoning districts (Zoning Values within PLUTO) • Is one zoning district range being assigned to a single column or are we checking off if those zoning ranges apply to the project in three Boolean columns? • Is there a hierarchy for the different zoning district ranges: R1 through R4, R5 through R10, C or M? (i.e. if a BBL falls into multiple categories is one range assigned over the others)

Air quality – GIS Team needs to vet the sources

Arterial highways and vent structures – GIS Team needs to review the sources • Arterial highways: if a project is located next to multiple features, do we need to list the name of all or just one? Closest? • Arterial highway source – is this just the arterial highways in the DCM_ArterialsMajorStreets open dataset • Vent tower: is this a distinct question or part of the arterial highway question? Do we need to list the name of the vent?

Elevated subway or railway • if a project is located next to multiple features, do we need to list the name of all or just one? Closest? • Source: GIS Team to compare LION vs data received from PS

Airport – GIS Team needs to vet the source • Confirm if EWR is excluded

Natural Resources and Shadows – GIS Team needs to review the sources (who determines what source is valid?) • if a project is located next to multiple features, do we need to list the name of all or just one? Closest? • Noticed a state and federal wetland dataset – should we use only one? How do we handle conflicts? • Beaches: need to identify a source

Historic and Shadows – GIS Team needs to review the sources • if a project contains or is located next to multiple features, do we need to list the name of all or just one? Closest?

Open Space/Shadows – GIS Team needs to review the sources • Should we include federal park properties?

E-designations • We need to create a field for each type of e-designation (noise, air quality, hazmat)? • For lots with multiple e-designations, should they be concatenated into one field? • Should we include restrictive declarations (i.e. e-numbers that start with “R”) • Source: e-designation table (csv)

fvankrieken commented 4 months ago

Just wanted to clarify - who is point person (on our end) for questions for PS?

fvankrieken commented 4 months ago

("our end" meaning GDE, not DE)

AmandaDoyle commented 4 months ago

@croswell81 and @jackrosacker please see my answers below

Zoning districts (Zoning Values within PLUTO) • Is one zoning district range being assigned to a single column or are we checking off if those zoning ranges apply to the project in three Boolean columns? • Is there a hierarchy for the different zoning district ranges: R1 through R4, R5 through R10, C or M? (i.e. if a BBL falls into multiple categories is one range assigned over the others)

If it's possible to calculate the Zoning District for the CEQR II form using the existing 4 zoning district fields in PLUTO that may be preferable. To be considered for the R5 through R10 Residential zoning district the 4 zoning district fields must only include R5 through R10 values and be absent of any other zoning district. Whereas to be considered as an R1 through R4, an R1 through R4 value must simply appear in 1 of the 4 zoning district fields. A BBL is assigned a zoning district if 10% or more of the bbl's area is covered by that zoning district. Zoning district values are assigned from most covered to least coverage, with Zoning District 1 covering the greatest area and Zoning District 4 covering the least area. For CEQR II we just need to know if a project is R5 through R10 or R1 through R4.

Arterial highways and vent structures – GIS Team needs to review the sources • Arterial highways: if a project is located next to multiple features, do we need to list the name of all or just one? Closest?

We just need to know if a project is near an arterial highways or vent structure, or not. This would be a boolean value; we do not need to report any name.

• Arterial highway source – is this just the arterial highways in the DCM_ArterialsMajorStreets open dataset

That is a question for Andrew E.

• Vent tower: is this a distinct question or part of the arterial highway question? Do we need to list the name of the vent? You do not need to list the name of the vent. Check with Planning Support if vents need to be reported separately from highways.

Elevated subway or railway • if a project is located next to multiple features, do we need to list the name of all or just one? Closest? • Source: GIS Team to compare LION vs data received from PS

We just need to know if a project is near an elevated subway or rail, or not. This would be a boolean value; we do not need to report any name.

Natural Resources and Shadows – GIS Team needs to review the sources (who determines what source is valid?) • if a project is located next to multiple features, do we need to list the name of all or just one? Closest? • Noticed a state and federal wetland dataset – should we use only one? How do we handle conflicts? • Beaches: need to identify a source

Like the above, we just need to know if a project is near a natural resource and not what it is. Given that we just need to know if something is near something we don't need to worry about conflicts. I proposed GIS team recommend what the data source should be used.

Historic and Shadows – GIS Team needs to review the sources • if a project contains or is located next to multiple features, do we need to list the name of all or just one? Closest?

Same as above.

Open Space/Shadows – GIS Team needs to review the sources • Should we include federal park properties?

I'd ask PS.

E-designations • We need to create a field for each type of e-designation (noise, air quality, hazmat)? • For lots with multiple e-designations, should they be concatenated into one field? • Should we include restrictive declarations (i.e. e-numbers that start with “R”) • Source: e-designation table (csv)

We need to work through e designations with PS and EARD.

Regarding GDE POC to PS and EARD - I'd prefer to not play telephone, so whoever is doing the work should reach out directly, but loop me in on all communications and come to me first if there are any questions you have that you want to discuss internally before reaching out to PS and EARD.

damonmcc commented 4 months ago

from sprint planning on 1/29

damonmcc commented 3 months ago

from sprint planning on 2/13

DE and GIS will prioritize 3 categories for a demo on Friday 2/23: Zoning, Air, Noise

GIS needs a version of the new data to load into the survey map

Jack has been using a dummy dataset to prototype the map

croswell81 commented 3 months ago

I am cleaning up the CEQR_Type_II_Data_Source_Review.xlsx doc. I will let you know when all edits are completed.

damonmcc commented 3 months ago

We've imported all source data and confirmed output specifications for the POC

Now working on build logic. To help keep GIS unblocked, we'll export mock data by tomorrow morning.

damonmcc commented 3 months ago

notes from Survey demo on 2/23

jackrosacker commented 3 months ago

Noting a couple of data items that popped up last week that I don't want to forget:

croswell81 commented 3 months ago

Since Arterial Highway is such a specific dataset, I would recommend keeping DCM arterial centerline data and doubling the buffer to 150 feet. We should confirm with Planning Support and EARD but it would be a much easier task than trying to recreate the arterial highway list from another source. @jackrosacker

jackrosacker commented 3 months ago

Could we use the centerline to select associated planimetric features within a set distance, and then base the 75' buffers off of those selected features? (Damon's idea from last week I think)

croswell81 commented 3 months ago

It would take someone going in and cleaning up all other roadways that are within that 75 buffer. The planimetric data doesn't have attribute data to narrow down what could be selected.

croswell81 commented 3 months ago

Also, we should remember this is a flag, not a precise calculation. Using the Arterial Highway is the most accurate dataset, that allows us to use an existing public resource. Adjusting the buffer is the better way to go.

The planimetrics dataset is derived from aerial images and is updated every 2-4 years, not as reliable.

jackrosacker commented 3 months ago

Okie doke. We can run the increased buffer idea by folks in the pm meeting today

damonmcc commented 3 months ago

@croswell81 @jackrosacker

During one of our chats with Planning Support, there was mention of an Air Quality check I don't think we're doing yet: "are you within 400 feet of a manufacturing land use?"

It seems like the logic for this flag would be something like:

And the source data to pass along for the map would be "all manufacturing lots"

Does that sound right?

jackrosacker commented 3 months ago

That sounds tentatively correct to me, pending any changes from Matt.I added a placeholder question for this in the survey as well, until we have final language. I forwarded you the email from Alex with the initial question, and placing the email text here:

Hi ITD Team,

We had a discussion with Stephanie today that included the desire for an addition to the air quality section of the tool. We currently do not have any logic to flag the potential for an unpermitted air quality source, and it is just in a note necessitating a manual check. We are hoping it may be possible to query within a 400 foot buffer from the site if there is any lot with a manufacturing land use (using PLUTO). If there is, we would flag that they need to check this study area for unpermitted industrial sources.

The question would be something like: Are there any manufacturing or processing facilities operating within 400 feet of the Development Site that may be unpermitted sources of air quality emissions?

We are happy to use some of our time next week to discuss.  

Definitely an item to work on further with them.

croswell81 commented 2 months ago

@damonmcc @jackrosacker They are looking to identify uses, not zoning, so the query or filter should use Land Use code = 06.

We should confirm with Planning Support and EARD there are no other Land Uses or building codes that allow "processing facilities" besides '06'.

jackrosacker commented 2 months ago

@damonmcc, Alex shared this pilot area with me: 3285 Fulton Street BBLs available in the ZAP link

And a second pilot area for BBL 4122770001. I'll see if we can add you to the PPT, I'm not sure how much of the draft doc info is ok to live here.

jackrosacker commented 2 months ago

@damonmcc et al, I spotted a couple of possible issues with the zoning classifications within the green fast track bbl dataset. Item 1: Lots with some R1-R4 present are getting classified as R5-R10 image I this example, lot 1 (red underline) is classified as R1-R4, while lot 70 (blue underline) is classified as R5-R10. I think both should be R1-R4, since any part of the lot in both cases is within that range. I could be wrong about this, so let's discuss before you start re-engineering anything.

Item 2: Lots with no zoning category at all See BBL 4124950002

I haven't done any form of comprehensive error checking yet, just flagging these as they pop up

fvankrieken commented 2 months ago

From my understanding, we (AD and PS) agreed on "pluto classification" as truth for whether R1-R4 (or containing any specific district). That threshold is 10% of the district.

croswell81 commented 2 months ago

@jackrosacker @fvankrieken beat me to it. We don't assign zoning to PLUTO lots if it less than 10% and that will apply to this app and data as well.

fvankrieken commented 2 months ago

I'll look into 4124950002

jackrosacker commented 2 months ago

Ok, I'd lost track of the fact that we're following the 10% rule for GFT as well. I'll add that to the zoning aggregation description in my presentation.

@fvankrieken - seeing ~10k lots without any zoning classification in what I believe is the latest version of the dataset. There might be a legitimate reason for some of these that's eluding me

fvankrieken commented 2 months ago

Current implementation is based on the original zoning logic, which said one of

4124950002 is R6 and C4, meaning it fits none of those three categories. I think you and @damonmcc had come up with a revised little decision tree, which will need to be implemented.

Though still not quite sure what this should be "flagged" as? For this specific case of R6 and C4?

jackrosacker commented 2 months ago

Pretty sure this one would be a C or M lot. My understanding of the language is that it only requires the presence of C or M, not necessarily "wholly."

This is probably a post-demo discussion. Sounds like we should discuss and refine the zoning decision tree, and choose a non-null text value to indicate lots that are in none of the buckets above ("Other", "Ineligible", etc.)

As I'm writing this I'm wondering if we also need to account differently for lots that are a mixture of R1-R4 and C or M.

croswell81 commented 2 months ago

After the demo, maybe GIS and DE can get together again to review the logic of each field, including updating buffer distances.

jackrosacker commented 2 months ago

Logging some thoughts here for how to account for lots with both R1-R5 and C or M present. Each option represents a hypothetical project site in which four lots are selected, each with the following zoning values.

Option 1:

Table has a single zoning category column, with four possible options per BBL. BBL Zoning Category
1111111111 R1-R4
2222222222 R5-R10
3333333333 C or M
4444444444 R1-R4 with  C or M

Option 2:

Table has two zoning columns, resi and c or m. There are still four possible zoning options per BBL, but pulled from values combined across both columns. This example uses the actual value within the C or M column, not just a Yes/No option.

BBL Residential Zoning C or M Zoning
1111111111 R1-R4  
2222222222 R1-R4 C or M
3333333333 R5-R10  
4444444444   C or M

Option 2b:

Table has two zoning columns, resi and c or m. There are still four possible zoning options per BBL, but pulled from values combined across both columns. This example uses a Yes/No value within the C or M column, not the actual value.

BBL Residential Zoning C or M Zoning
1111111111 R1-R4 No
2222222222 R1-R4 Yes
3333333333 R5-R10 No
4444444444   Yes

For our upcoming GFT data meeting @croswell81 @damonmcc @fvankrieken (since I think you're tackling the zoning stuff?)

damonmcc commented 2 months ago

@jackrosacker should Option 2 have more specific values in the C or M Zoning column? it says "This example uses the actual value within the C or M column, not just a Yes/No option."

damonmcc commented 1 month ago

notes from DE & GIS chat on 4/2

Lot Zoning info

Natural Resources

Historic Resources (Alex)

Rail

Rail Yards

Beaches

damonmcc commented 1 month ago

@jackrosacker @croswell81

from our chat about Lot Zoning vs Project Zoning, I tried to illustrate the logic we'd implement in order to do Option 1 above (one Zoning column). I'll put the diagram in this comment and this link to the PR it's coming from

Logic for Lot Zoning in GFT

image

croswell81 commented 1 month ago

@damonmcc @jackrosacker This looks correct to me.

jackrosacker commented 1 month ago

Agreed, with one amendment: Any R5 - R10? should be Entirely R5 - R10?

@damonmcc @croswell81

damonmcc commented 1 month ago

thanks @jackrosacker! I think I see what you mean about changing to Entirely, so I revised it to be the diagram below.

I wanted to still use Any because that frame of mind translates really well to the logic/code we'll have to write. Hope this captures it!

image

jackrosacker commented 1 month ago

Cool, yeah that seems to cover more bases. I wonder if there's ever a circumstance in which there's a third branching option from the Any R5-R10 in which the lot is partially Other, meaning that the lot could not be classified as "Entirely R5 - R10". Or other possible edge cases? I'll leave subsequent comments in the #741 issue

@damonmcc

jackrosacker commented 1 month ago

@damonmcc I wanted to also suggest that we find and record a BBL for each possible zoning combination to use as benchmarks. If that's easy for DE to gather while writing the queries, could you add a list of BBLs to this issue? I'm happy to poke around and find them as well if helpful.

jackrosacker commented 1 month ago

@damonmcc

Notes from Damon <> Jack on 2024-04-09:

jackrosacker commented 1 month ago

@damonmcc I've been able to successfully publish the source_data_versions table to ArcGIS Online, and pull the dataset name and vers/date values into the survey as static values. This will enable us to print out the data versions at the bottom of each report PDF for easy reference. A few things occurred to me while implementing this:

This doesn't have to happen before the Beta.

damonmcc commented 1 month ago

@jackrosacker

Are the column headers finalized for this table? The naming convention doesn't matter at all, but the report will break if we change convention later

The column headers are finalized. This table is generated by code that we use in all builds to load source data.

Noticing variability in the date format in the 'v' column. I personally prefer the 1900-01-01 format. Requesting that we normalize to one format

Definitely possible. Since these are versions of source data, DE kind of treats these more like strings than dates. So changing the format sort of breaks the 100% certainty that we'll find that exact value in edm-recipes, but no worries! we're talking about formatting to display and DE can always retrieve the pre-formatted value if we need to.

Are we confident that each of the version/date values are getting updated when each build happens? I don't see anything to indicate otherwise, but want to be sure these values are accurate before printing into the report

Yup this file green_fast_track/recipe.yml ensures that every build uses a particular version (in this case the latest version) which is then documented in the source data version table.

Do the dates indicate a specific thing? i.e. date the data was ingested for external, date the data was built for internal, etc.

This varies by dataset, specifically the source of the dataset. If the created data is programmatically available, the version is the created date. If it isn't available, the version is the ingested date.

Everything we ingest from NYC Open Data or an ArcGIS feature server uses the "last updated" value as the version.

We'd like to make this more clear for all source data and perhaps could reflect the details by one or more new columns in this table, but don't plan to do that soon. We may be able to describe what the version means for each dataset in DE's GFT documentation though!

jackrosacker commented 1 month ago

@damonmcc thanks for this. Glad to hear that the data is reliably up to date, and that the field names aren't expected to change. Based on what you're saying, I think we should:

damonmcc commented 1 month ago

potential source data changes/additions

source data details

GIS

damonmcc commented 1 month ago

from @jackrosacker on 5/1

priorities re DE items for the Beta launch on 5/3

Priority 1

Priority 2

Priority 3

jackrosacker commented 3 weeks ago

@fvankrieken and @alexrichey, following up on our conversations this week with a punch list of Fast Tracker items. I've tried to be as comprehensive as possible, but there are a few items such as dataset aggregation and column/dataset naming review that will probably benefit from a closer look together before DE puts in too much work.

Also - this list alludes to but does not fully cover the remaining datasets to be added. These fall into two buckets: (1) datasets that have been cleared by Matt/Planning Support and handed off to DE but haven't been processed yet, and (2) datasets that are still being digitized/approved.

I'm out of office this afternoon but could talk through and help prioritize the below list tomorrow if either/both of you are available.

  1. Create new Historic Resource buffer (in progress)
  2. Incorporate all outstanding datasets that have been made available by Matt etc. (e.g. rail yards, DOB wetlands) - Matt will have more detailed information on these
  3. Update dataset names for consistency
    • No huge changes needed, just to review conventions across entire population of datasets and ensure rough consistency. Happy to do this review together.
  4. Update arterial hwy buffers - still using the old distance. Review all other buffer distances for other possible
  5. Create a data sources page in Github - beta is currently linking straight to root of DE GH
  6. Update green_fast_track_bbl dataset:

    • Finish adding per-question flag columns

      • Existing, in 4/25 GDB:

      • State_Regulated_Freshwater_Wetlands___Checkzone_Flag

      • State_Archaeological_Areas_Flag

      • Natural_Resource_Shadow_Flag

      • Needed, calculated by GIS Team from 4/25 GDB:

      • Natural_Resource_Flag

      • Historic_Districts_Flag

      • Historic_Resource_Flag

      • Historic_Resource_Adjacent_Flag

      • Open_Space_Shadow_Flag

      • Historic_Resource_Shadow_Flag

    • Add Natural Heritage Communities name/id column

    • (optional) Ensure that any outstanding datasets have a corresponding column created, even if all values are null)

    • Edit field names for consistency:

      • Name character count <= 25 characters (see existing, some are as long as 52 char)
      • Leading character must be alphabetical (see existing, some with leading '_' char)
      • Maintain naming conventions across the entire dataset: i.e. If multiple admin levels of a dataset are included (i.e. historic resources) make sure that each corresponding field explicitly names that level (i.e. nyc, nyc, us)
      • Aliases don't matter for now - I re-alias anyway on my end
    • Take note of any fields that keep the name but change meaning. See Damon's example re Historic Districts: "And there's one flag that will be renamed (and not show in the tool anyway) as to not conflict with a question flag: Historic Districts -> City Historic Districts."

  7. Spelling of LaGuardia in airports dataset -> currently all lowercase (this is very low priority! just getting it in here)
  8. Return all point datasets as filtered to include only those points that have not been successfully joined to a lot (this is already the case for, say, historic resource pts, but not for CATS Permits)
  9. Create single dataset per question and per geometry type for the latter sections: Natural/Historic/Shadows
    • For buffers: merge and dissolve all constituent datasets into a single feature
    • For features: merge all constituent datasets and retain variable_type and variable_id values for each feature/row
    • Examples, using the datasets I manually produced (I used a "gft_" prefix, that doesn't have to remain static):
      • gft_shadow_open_spaces_buffer
      • gft_shadow_open_spaces_lots
      • gft_historic_resources_points
      • gft_historic_resources_lots
      • gft_historic_resources_buffer
      • gft_historic_districts
      • gft_natural_resources
    • (I have a more in depth table of these transformations and examples of the outputs produced, if helpful to review together)

cc @croswell81

jackrosacker commented 3 weeks ago

@damonmcc @croswell81 I added data to our running GFT Data Sources sheet.

  1. Added columns to the Survey Questions tab:
    • App Flag Alias -> these values are what I'm currently using in the app as aliases for the flag columns
    • Possible Flag Field Name -> these are possible green_fast_track_bbl field names, based on my aliases and your existing field names (my alias uses "elevated" because that's in the survey, but DE switched to "exposed" which I like)
  2. Added a Flag to Name Pairing tab to capture the name/id aliases that I'm using, paired with the relevant flag. I haven't translated these into gft bbl field name ideas yet

Take a look when you get a chance and we can regroup as needed.

croswell81 commented 3 weeks ago

@damonmcc @jackrosacker Latest data updates (refer to bullet 2 in Jack's comment from 5/8 above). All data updates are reflected in the CEQR Type II Data Source Review doc.

Updates:

  1. Zoning data: signed off (marked approved)
  2. Exposed Railway (marked as approved) a. received updated version from GR b. can send via Teams or you can use latest LION and process - let me know what you prefer
  3. Exposed Railyards (marked as approved) a. received queries from GR b. instructions are provided in the Data Processing and GDE Notes columns c. can send via Teams or you can use instructions to process - let me know what you prefer d. GIS will provide the Railyard_HudsonYard_erase.shp in Teams
  4. Beaches (marked as approved) a. Updated the Name column with new values that should be returned in the CSV export table b. Will provide new data in Teams
  5. NYCDOB Natural Resource Check Flags (marked as approved) a. GIS provided data in Teams – let me know any questions b. One spreadsheet contains flags for following: i. Tidal Wetland ii. Freshwater Wetland iii. Coastal Erosion Hazard Area
  6. DPR Park Properties – No changes, keep as is, we never heard back from Parks

Pending:

  1. Recognized Ecological Complexes (RECs) – working out methodology for this. Planning Support understands this will take more effort and OK it is not ready by Go Live. Will still try to get this ready for the 6/3 deadline.
jackrosacker commented 2 weeks ago

Noting to @damonmcc and @croswell81 that as I understand our design, the output CSV will have a single name/id column per dataset, meaning that e.g. if a lot both intersects with a tidal wetland (Nat Res question) and is also within 200ft of that tidal wetland (Shadow question), the name/id of that feature will appear once(?) in the single name/id column for that dataset, but will not differentiate in the export which question or flag it is associated to.

I think this makes sense to some degree, but had been vaguely assuming that the tidal wetland id would appear under a column for the natural resources intersection and another for the shadow buffer.

Does this line up with your understanding? Or am I missing something and we're planning to reflect that same tidal wetland ID in two columns, one per question/section?

croswell81 commented 2 weeks ago

@jackrosacker I was thinking that any resource that triggers a buffer that intersects a project lot would only be in one column (i.e. historic dist (contains, within 90 ft, within 200 ft - shadow) but realize the buffers are different and therefore all resources will not apply to all buffers and questions.

I think we should send to Planning Support and see if they care before we have DE add a bunch of new columns to the export table. cc: @damonmcc

jackrosacker commented 2 weeks ago

Started to plot this out in advance of emailing PS, and ran into a few other wrinkles. Let's take an example project like below:

image

Iteration 1 - Our current design would have a single CSV column per dataset, regardless of how many times that dataset relates to the project, or through which question/spatial relationship:

BBL NYC Hist Res ID NYS Hist Res ID
3030530013 F. J. Berlenbach House  
3030530016 F. J. Berlenbach House  
3030530019 F. J. Berlenbach House  
Iteration 2 - The alternative we discussed above, which ends up adding a column per question and per dataset, so that for each BBL you know the dataset, feature ID, and question/spatial relationship relevant: BBL NYC Hist Res ID NYC Hist Res ID - Adjacent NYC Hist Res ID - Shadows NYS Hist Res ID NYS Hist Res ID - Adjacent NYS Hist Res ID - Shadows
3030530013 F. J. Berlenbach House          
3030530016   F. J. Berlenbach House        
3030530019     F. J. Berlenbach House, EXAMPLE RESOURCE FROM SAME SRC DATASET      
Iteration 3 - A third option, in which each question has corresponding data name/id field in the csv, and names/ids are grouped with a categorical prefix to indicate the relevant data source (imaginary datasets in all caps to demonstrate how multiple datasets would be aggregated into a single column): BBL Hist Res ID Hist Res ID - Adjacent Hist Res ID - Shadows
3030530013 NYC: F. J. Berlenbach House    
3030530016   NYC: F. J. Berlenbach House, NYC: AN EXAMPLE RESOURCE, NYS: AN EXAMPLE RESOURCE  
3030530019     NYC: F. J. Berlenbach House, NYC: AN EXAMPLE RESOURCE

Iteration 1 is the most concise, but makes it harder for an applicant or EARD to review and application and understand how the BBLS, flag datasets, and questions interact with one another. Iteration 2 gives makes the review easier, but substantially increases the number of columns required. Iteration 3 is a combination of the other two, with the benefit of having fewer output fields but data that is harder to parse per CSV cell.

After exploring these directions, Iteration 1 (what we have now) feels the most viable. I don't currently have any other ideas for how to design this, do you two? @croswell81 @damonmcc

Edit: removed numeric values from field names, added an example of a concatenation of multiple features per question/bbl in iteration 2

damonmcc commented 2 weeks ago

@jackrosacker love having an example and those tables!

Iteration 1 is how we had it. While making the changes to have the flag-column-per-question structure of the final table, there's now a single id-column-per-question in that final table.

Iteration 3 is my favorite: one column of ID values per question. But I wonder if having column names like Hist Res - Adjacent would be better than Hist Res ID - 90', so that people can easily relate the column to the question and you won't have to maintain buffer values in alias strings.

If EARD review requires "exactly which dataset did that value come from", Iteration 2 seems like something we can add later in addition to Iteration 3.

croswell81 commented 2 weeks ago

@jackrosacker @damonmcc The example is missing how many of these values would just be repeated, since any lot that has a historic resource will also be in the buffer, or any resource within 200 feet will also be within 90 feet.

My concern is this could potentially add dozens of fields since there are 10+ natural resource fields that go into NR shadow, and another 5-8 historic resources with two buffers, etc.

We should try to meet tomorrow when Jack is available.