DOF - Discovery on Fennis' process

cagov / data-infrastructure

CalData infrastructure

https://cagov.github.io/data-infrastructure

MIT License

7 stars 0 forks source link

DOF - Discovery on Fennis' process #154

Closed britt-allen closed 11 months ago

britt-allen commented 1 year ago

[x] Discovery on current process for flagging data and using in analysis
[x] Process for doing data updates to footprints
[x] Flagging specific footprints as bad (logic behind this manual process)
[x] Brainstorm how to preserve previous data updates so they’re not overwritten when new data become available
[x] Code samples
[x] https://github.com/cagov/data-infrastructure/issues/155
[x] Get a better understanding of how they use this data in decision making (Reports, emails, etc. Can we review? Are these relevant)

britt-allen commented 1 year ago

Recapping what we learned during our discovery calls with @fennisreed. Fennis if I have misunderstood or missed something can you clarify for me, please? cc: @ian-r-rose

Current process for flagging data and using in analysis

Fennis intersects Parcels with BF data (BF = Building Footprint)
Fennis uses a 90% threshold to determine the retention of a BF in a parcel (meaning if 90% or more of a BF is in a parcel we make the intersection, if not we abandon the intersection?) This is currently a manual/visual process - no calculation used
If nothing has the bulk like 20%, 20%, 20%, 40%, we wanna split it (meaning if no BF takes up the majority of a parcel, 90%+, we split the parcel up with each corresponding BF?)
The final BF joined with Parcel table would have X dimensions (columns). These are:
- Parcel index (Is this something we'd create - e.g. uuid or seq4 - or do parcels already get indexed?)
- Building footprint area (This is just the geometry, correct?)
- Inverse euclidean distance raster (Distance between other building footprints - this number gives a view into housing unit density)
- Residential parcels (this column would be a flag for if a parcel is

How does Cadastral-based Expert Dasymetric System (CEDS) fit into all of this? Would this be a technique we use to add three columns to the final dataset for parcel-level population, city, and county estimates?

britt-allen commented 1 year ago

Current Process for flagging specific footprints as bad

First step is usually to scope to county
Area_geo is not always filled, this is something @fennisreed manually adds (What is area geo? How do you arrive at this number? is it the intersection of a BF with a County?)
Fennis has no preference with SI or imperial units - just make sure we document
@ian-r-rose Ian made mention of a Make Valid function to address bad geometry
Data cleaning:
- Fennis recs using any geo repair functions before doing area calcs
- Sorts by largest areas and manually looks through top 100 to visually see what is a building, deletes what is not - does this county by county - Goes finer grain for Los Angeles, Riverside, and San Bernardino counties
Explore geohashing for adding a primary key/id option
Likes that geodatabases are reduced storage compared to using shapefiles and is consistent between tools they are using - Geodatabase for DOF deliverable, but another open format for public use
In the case when a footprint touches two counties, we’ll start with assigning it to the one where the majority of the footprint is. Do we want to capture additional information about overlapping footprints?
- Error column
  - Geometric correction, overlapping xx, … (Still unsure what this column represents and the logic to get there, not the code, just the thinking that would inform the code)
- County (A field for the county is preferred. Each footprint should be coded to a single county.)
- Intersection flag (In addition to the county field, a single status to denote a footprint intersects multiple counties would be useful.)
- If there is additional metadata we get from the geometric correction tool, this would be a good place to store it. (what tool? ESRI?)

Are we missing other columns that should exist in a table of BF joined with county data?

britt-allen commented 1 year ago

Current process for doing data updates to footprints / Brainstorm how to preserve previous data updates so they’re not overwritten when new data become available

The process that was outlined above deleting the top 100
Is there a way to see the diffs? (As in a way to see what data was deleted?)
Secondary table with bad geometry that we can join on (As in create a table where we store deleted data?)
- Still the issue of lack of indexes (or UUIDS) so hard to establish that relationship
- Geohashing downside is its per point, not most reliable
Date field == date of last update
Capture date range is supposed to be when the imagery is from

cc: @ian-r-rose @fennisreed

fennisreed commented 1 year ago

Recapping what we learned during our discovery calls with @fennisreed. Fennis if I have misunderstood or missed something can you clarify for me, please? cc: @ian-r-rose

Current process for flagging data and using in analysis

Fennis intersects Parcels with BF data (BF = Building Footprint)

Fennis uses a 90% threshold to determine the retention of a BF in a parcel (meaning if 90% or more of a BF is in a parcel we make the intersection, if not we abandon the intersection?) This is currently a manual/visual process - no calculation used

If nothing has the bulk like 20%, 20%, 20%, 40%, we wanna split it (meaning if no BF takes up the majority of a parcel, 90%+, we split the parcel up with each corresponding BF?)

The final BF joined with Parcel table would have X dimensions (columns). These are:

Parcel index (Is this something we'd create - e.g. uuid or seq4 - or do parcels already get indexed?)

Building footprint area (This is just the geometry, correct?)

Inverse euclidean distance raster (Distance between other building footprints - this number gives a view into housing unit density)

Residential parcels (this column would be a flag for if a parcel is

How does Cadastral-based Expert Dasymetric System (CEDS) fit into all of this? Would this be a technique we use to add three columns to the final dataset for parcel-level population, city, and county estimates?

Hi Britt! A few clarifications:

The 90% threshold of a BF within a parcel would result in the intersect being abandoned and a single Parcel Index being assigned to the footprint. This is done programmatically by calculating the proportion of intersecting geometries, and is not a manual process.
In the 20%, 20%, 20%, 40% example, the resulting building footprint would be split into 4 segments with different Parcel Indexes. It will be worth confirming for incredibly large BF (such as shopping malls with parcels for each store front) that situations where all segments are <10% of total area should still be split into individual parcels.
Parcels are already indexed! An updated footprint area will be required. Two indexes for the building footprint would be ideal - One representing the origin footprint available from the public dataset, with a secondary key for those footprints that have been further segmented.
The inverse Euclidean distance is calculated later - I would say is a bonus if we get to it, but out of scope for this project. Residential coding is updated annually from the county assessor, which vary wildly between counties. I'd rather leave this one out of scope too so I can have more control over how they are incorporated during the CEDs component of an estimate.

fennisreed commented 1 year ago

Current Process for flagging specific footprints as bad

First step is usually to scope to county

Area_geo is not always filled, this is something @fennisreed manually adds (What is area geo? How do you arrive at this number? is it the intersection of a BF with a County?)

Fennis has no preference with SI or imperial units - just make sure we document

@ian-r-rose Ian made mention of a Make Valid function to address bad geometry

Data cleaning:

Fennis recs using any geo repair functions before doing area calcs

Sorts by largest areas and manually looks through top 100 to visually see what is a building, deletes what is not - does this county by county - Goes finer grain for Los Angeles, Riverside, and San Bernardino counties

Explore geohashing for adding a primary key/id option

Likes that geodatabases are reduced storage compared to using shapefiles and is consistent between tools they are using - Geodatabase for DOF deliverable, but another open format for public use

In the case when a footprint touches two counties, we’ll start with assigning it to the one where the majority of the footprint is. Do we want to capture additional information about overlapping footprints?

Error column

Geometric correction, overlapping xx, … (Still unsure what this column represents and the logic to get there, not the code, just the thinking that would inform the code)

County (A field for the county is preferred. Each footprint should be coded to a single county.)

Intersection flag (In addition to the county field, a single status to denote a footprint intersects multiple counties would be useful.)

If there is additional metadata we get from the geometric correction tool, this would be a good place to store it. (what tool? ESRI?)

Are we missing other columns that should exist in a table of BF joined with county data?

Area_Geo is an artefact from an earlier version of the footprints. Ignore this field, and proceed with out own calculated area for each footprint.
For geometric correction the tool I've used in the past has been from ESRI. The tool outputs a series of statuses for the corrected features to identify self intersections, donuts, redundant vertices, and other minor corrections that may occur. While there is no direct need to record these statuses, they may be helpful for future users to query. If correction metadata is not readily available from something like Make Valid, I wouldn't stress too much about adding this.
No additional fields to add at the county level.

ian-r-rose commented 1 year ago

Area_Geo is an artefact from an earlier version of the footprints. Ignore this field, and proceed with out own calculated area for each footprint.

Thanks, @fennisreed. To clarify: is it helpful to have a separate pre-computed column for area? If I were doing this with geopandas, I would not include an area column in the table, and just compute it on the fly from the geometries as needed. But there may be workflow-related reasons to have it that I'm not aware of.

ian-r-rose commented 1 year ago

Parcels are already indexed! An updated footprint area will be required. Two indexes for the building footprint would be ideal - One representing the origin footprint available from the public dataset, with a secondary key for those footprints that have been further segmented.

I think you talked about this already, but how are parcels indexed? Is it by assessor ID of some sort, or an opaque UUID of some sort? Are the IDs guaranteed to be stable between different versions of the dataset?

fennisreed commented 1 year ago

Area_Geo is an artefact from an earlier version of the footprints. Ignore this field, and proceed with out own calculated area for each footprint.

Thanks, @fennisreed. To clarify: is it helpful to have a separate pre-computed column for area? If I were doing this with geopandas, I would not include an area column in the table, and just compute it on the fly from the geometries as needed. But there may be workflow-related reasons to have it that I'm not aware of.

The main need for footprint area is for the Parcel Summary tables for internal use, where a sum of area per parcel is much easier to process ahead of time, rather than attaching it to CEDS or the random forest. It can take a fair amount of time to calculate, so just doing the whole state once per update has been a manageable way to handle it in the past.

I think we could forgo an area calculation for the public product.

Parcels are already indexed! An updated footprint area will be required. Two indexes for the building footprint would be ideal - One representing the origin footprint available from the public dataset, with a secondary key for those footprints that have been further segmented.

I think you talked about this already, but how are parcels indexed? Is it by assessor ID of some sort, or an opaque UUID of some sort? Are the IDs guaranteed to be stable between different versions of the dataset?

All parcels have received a DOF specific index from the FME GOID tool, resulting in unique identifiers like 473F433D6CC29746265F525D92394292. We rely on APN and County to join back to ParcelQuest data, as the same APN can be present in multiple counties. Otherwise, ParcelQuest does not distribute a primary key.

The DOF applied IDs have been the same since 2021, and have been used in a good number of deliverables and scripts so I'd prefer to retain the same IDs if possible.