DOF - Establish local workflow to join ParcelQuest data with Building Footprints

britt-allen commented 1 year ago

ParcelQuest data is private, so we can't share any derived data products via our S3 bucket. Instead, we are generating some local utility functions that allow us to do the following:

Take our public building footprints datasets hosted in S3 and do a spatial join on any subset of parcel shapes Write some heuristics for footprints which intersect more than one parcel (specific numbers are made up, we should make it tune-able and set good defaults in collaboration with @fennisreed): If a footprint is ~90% in a parcel, just assign it to that parcel, drop any others that might intersect with it. If a footprint has ~30-70% intersections with a parcel, split the footprint and assign the sub-footprints to the relevant parcels. The resulting footprints-with-parcels dataset should be able to easily generate the following "final" data files:

A table with Parcel Index and building footprint area (i.e., the corrected area of the footprint as processed above) A spatial file (shapefile, parquet, or otherwise) with Parcels as primary geometry, footprint area, and the geometry of all footprints that might exist on the parcel aggregated/ST_UNION'd.

fennisreed commented 1 year ago

Thank you for chatting today, @britt-allen!

When intersecting with building footprints, there are two criteria in which part of a footprint should be dropped:

<10% of a footprint is contained in the area UNLESS the footprint is being split into >=5 parcels (for dense urban areas, where sometimes footprint areas <10% are accurate where the the remotely sensed footprint spans a whole city block).
If the total area of the footprint is < 300 sqft then it is likely a shed or sliver hanging into a road carriageway.

Stacked parcels may occur in the provided dataset, and are noted with a boolean field P_Stacked. These are areas where duplicate parcel geometry represents distinct units within the same structure or complex. If memory serves, stacked cases with > 10 parcels have been collapsed into a singular cumulative parcel favoring residential codes and total number of units. Stacked cases with <=10 parcels have been retained for unit-specific attribution.

For the purposes of this project, no parcels should be dropped from the dataset. In stacked situations, if the lower unit has 500sqft of footprint area, the upper (visually duplicate) unit should also receive 500sqft.

Summing intersecting footprint sqft over stacked parcels may result in > 100% of original footprint area. This complicates comparison using the 10% threshold. In these situations, I recommend calculating percent membership using the sum of intersected footprint areas, rather than the originally calculated footprint area.

Because this model is intended for use in human population estimation, favor should be given where USETYPE <= 112, which denote residential use. Because our estimates are regularly updated with jurisdiction specific permitting data, it is still helpful to calculate footprint area for all parcels including non-residential features - which may yet receive residential correction down the line.

Please let me know if I can provide further clarification!

britt-allen commented 1 year ago

Handing off to @ian-r-rose as the programming work is a bit more advanced than time allows me to learn and implement.

ian-r-rose commented 1 year ago

Hi @fennisreed, I'm working on refining some of the criteria you listed above, and wanted to run a couple of things by you:

<10% of a footprint is contained in the area UNLESS the footprint is being split into >=5 parcels (for dense urban areas, where sometimes footprint areas <10% are accurate where the the remotely sensed footprint spans a whole city block).

Makes sense! In some of my testing I'm finding that 20% seems to work a little better (at least according to my intuition about whether/how things should be split). In any event, I intend to make this number configurable. It also depends on what metric you use for the relative comparison, about which more below.

Summing intersecting footprint sqft over stacked parcels may result in > 100% of original footprint area. This complicates comparison using the 10% threshold. In these situations, I recommend calculating percent membership using the sum of intersected footprint areas, rather than the originally calculated footprint area.

I'm going to suggest a tweaked version of this that may be a little simpler to reason about: what if instead we calculate the percent as a fraction of the maximum intersected area? So the largest intersection of a footprint and parcel would always be 1.0, and everything smaller would be a fraction of that. We would then discard all footprint fragments that are less than X% of the largest intersection.

What I like about this is that it is a bit more robust to things like fully overlapping parcels (whether they are errors of inclusion or things like condos). But it's quite easy to test both approaches: the difference between them is a one line change.

ian-r-rose commented 1 year ago

@fennisreed I have some sample code with an initial implementation of the algorithms here: https://gist.github.com/ian-r-rose/d94547ba282be59398b4474327839bef

The intention of the functions is that they be put in a python module that could be imported into an ArcGIS notebook. If that is not possible, I suppose they could be copy/pasted around. But the actual usage of the algorithm is fairly brief and the cutoff parameters, etc are customizable:

# Read the data!
footprints = geopandas.read_parquet(
    "s3://dof-demographics-dev-us-west-2-public/global_ml_building_footprints/parquet/county_fips_003.parquet"
)
parcels = geopandas.read_file(
    "Artificial Parcel Network 2021.gdb", layer="T_003_Parcels"
)

# Do the spatial join and handle footprints that intersect multiple parcels.
footprints_with_parcels = assign_footprints_to_parcels(
    footprints,
    parcels,
)

It's probably best discussed in a higher bandwidth setting, but a few things I'm noticing about it:

It does a pretty good job with a lot of the simpler cases:
1. Simple buildings mostly on one lot with a sliver in another parcel
2. Larger buildings that cross a couple of parcels
It does less good a job with some more complex cases:
1. Large "super" (technical term) parcels with cutouts for condos/townhomes
2. Parcels with holes in them where the footprint crosses the hole
3. Fully overlapping parcels
4. Cases where the parcel geometry corresponds to a townhome/condo, but it actually doesn't align with the building geometry very well (possibly some bad coordinate transformation in the parcel data?) There's probably a lot of room for improvement in terms of doing an initial pass on parcels to filter out things with use codes you don't care about (this may get rid of some of the challenges with overlapping and condo-related parcels).

ian-r-rose commented 1 year ago

@fennisreed you also expressed interest in being able to output a CSV with parcel index and the total building footprint area for that parcel. Here is a snippet that can be tacked on to the above footprints_with_parcels calculation to do that:

footprint_area_per_parcel = (
    footprints_with_parcels
    .assign(footprint_area_sqft=footprints_with_parcels.to_crs(epsg=2226).geometry.area)
    .groupby("Parcel_Ind")
    .agg({"footprint_area_sqft": sum})
)
footprint_area_per_parcel.to_csv("footprint_area_per_parcel.csv")

This computes the sum of the footprint areas for each parcel (note the choice of EPSG:2226 to get Alpine county in feet).

ian-r-rose commented 1 year ago

TODOs from meeting on 11/8

[x] @ian-r-rose Experiment with adding another tuning parameter to not throw out intersections if there are more than X of them (under the theory that this is a multifamily unit with small parcels)
[x] @fennisreed to validate the CSV output snippet above

fennisreed commented 1 year ago

[x] @fennisreed to validate the CSV output snippet above

CSV export snippet works great - thank you for this, Ian!

Reviewing the larger parcel x footprint reconciliation script, everything seems to work as planned. The majority of circumstances were well represented with appropriate parameter tuning, save for dense urban areas like San Francisco.

We discussed instances where multifamily footprints are loosing portions for failing to meet the two thresholds, and have proposed a new parameter. Footprints intersecting >5 parcels will ignore the two subsequent thresholds. Examples of where this behavior could be valuable include Alameda (near footprint 379303 at 37.768149 -122.285325) and San Francisco (near footprint 56031 at 37.768149 -122.285325).

ian-r-rose commented 1 year ago

@fennisreed I've updated my gist to include a new parameter count_cutoff, which doesn't throw out any intersections if more than count_cutoff intersections occur. Let me know if this helps with any of the multifamily building cases you had seen!

britt-allen commented 12 months ago

Can this be safely marked done following yesterday's improvement and discussion between you and Fennis? @ian-r-rose

fennisreed commented 12 months ago

The count cutoff feature is working as intended, with the addition of new geometry types from the primary Geometry field. Export of resulting features works great for polygon and multipolygon features. This resolves all the multifamily instances we reviewed during our previous call!

An additional geometry type 'GeometryCollection' is now produced in some instances, which cannot be exported to ESRI Shapefile or geodatabase and requires further transformation. A great example of this is in Alameda County at block_geoid 060014507432048. Is there a way to easily convert these features to polygon or multi-polygon?

Thank you!

ian-r-rose commented 11 months ago

Hi @fennisreed, sorry for the slow reply here, and thanks for the nice reproducible case!

I took a look at the block you pointed out in the Alameda County dataset, and it seems that the issue is coming from the parcels dataset. There are a few parcels that are LineStrings, MultiLineStrings, and GeometryCollections. Presuming that there are no valid 1-dimensional parcels, I think we can extend the parcel cleaning step with the following snippet, which does three things:

Ensures that all parcels are valid geometries
Drops any linestrings or multilinestrings
Edits GeometryCollections to be polygon-only

from shapely.ops import unary_union

# Ensure the parcel geometries are valid
parcels = parcels.assign(geometry=parcels.geometry.make_valid())

# Drop any LineString or MultiLineString geometries as impossible parcels
parcels = parcels[~parcels.geom_type.isin(("LineString", "MultiLineString"))]

# Fix up any GeometryCollections to be polygon-only
fixed = (
    parcels[parcels.geom_type == "GeometryCollection"]
    .geometry
    .apply(
        lambda s:
        unary_union([g for g in s.geoms if g.geom_type in ("Polygon", "MultiPolygon")])
    )
)
parcels.loc[fixed.index, ["geometry"]] = fixed

This should be done before joining with the footprints dataset.

Does this resolve the issue? If so, I'll update the gist with the new preprocessing step.

fennisreed commented 11 months ago

Hi @ian-r-rose ,

Last week I was able to validate the latest portion of this script. Of the counties run, this seems to resolve any outstanding issues with geometry collections, small lines, or other sliver features brought in with the parcel dataset.

This script is a huge improvement to my process and will add considerable nuance when combining parcel and footprint data.

Thank you so much for this!

ian-r-rose commented 11 months ago

Thanks for the update @fennisreed! Closing as complete

cagov / data-infrastructure

DOF - Establish local workflow to join ParcelQuest data with Building Footprints #153