Grist-Data-Desk / land-grab-2

Code and methodology to produce the dataset in Grist's Misplaced Trust investigation
https://grist.org/project/indigenous/land-grant-universities-indigenous-lands-fossil-fuels
Creative Commons Zero v1.0 Universal
14 stars 2 forks source link

feat: Export a GeoJSON for university summary. #47

Closed parkerziegler closed 10 months ago

parkerziegler commented 10 months ago

This PR is a follow on to #40 to add support for GeoJSON exports of the university summary. I also want to chat a bit about the approach used here to ensure we're all ok with it! This same approach is used with the tribe summaries.

Generating MultiPolygons for the Summaries

For both the university and tribe summaries, we want to generate GeoJSON files using MultiPolygons as the core geometry type. For the university summary, each MultiPolygon represents the collection of parcel Polygons corresponding to each university. Likewise, for the tribe summaries, each MultiPolygon represents the collection of parcel Polygons corresponding to (1) present day tribe for tribe-summary-condensed.geojson and (2) common values for 'gis_acres', 'present_day_tribe', 'rights_type', 'university', 'state', and 'cession_number' for tribe-summary.geojson.

To generate these MultiPolygons, we use dissolve—the spatial equivalent of groupby—on the WGS84 version of the dataset generated as the output of Stage 3. We dissolve by the university column for the university summary, the present_day_tribe column for the condensed tribe summary, and the list of fields above for the full tribe summary.

Preserving the In-Place Aggregations

Step 4 already had significant aggregations and reshaping in place by the time I started this PR. At first, I considered trying to replicate these transformations in the GeoDataFrame alongside the original pandas DataFrame. This quickly got hairy, so I went with an alternate approach.

  1. Keep transformations in place throughout, operating on a freshly created pandas DataFrame. In essence, we're loading the same data from Step 3's output, but using the GeoJSON in lieu of the CSV. Then, we hand off the GeoDataFrame—with the geometry column dropped—to the existing aggregation functions.
  2. Perform the dissolve on the GeoDataFrame, using the exact same set of columns we use for the groupby. This gives us access to the MultiPolygon geometries. We only keep the geometry column from this transformation and fields necessary for the merge (see below)—all other columns are dropped.
  3. Merge the pandas DataFrame from the various transformations with the GeoDataFrame. This is essentially an attribute join in GIS land.

With this approach, we have minimal changes to the existing aggregations. In essence, we're just using dissolve to grab the MultiPolygon geometries and joining them to the existing summaries. Let me know how we feel about this approach! I think the upside is that it keeps the blast radius of this change small.

I'll call out the few places in the PR where I was receiving errors running on main and needed to introduce fixes.