PHSKC-APDE / spatagg

An R package to facilitate the spatial aggregation of estimates between non-nesting geographies
Other
0 stars 0 forks source link

README

spatagg

spatagg provides functions for the aggregation of estimates between non-nesting geographies. For example, “converting data at the census tract level to the ZIP code level.

Two main approaches are provided:

  1. Geographic/Fractional Approach: A target geography value is constructed from overlapping source geographies based on fractional overlap. If target T is covered 60/40 by geographies A and B, then: $T = .6A + .4B$

  2. Point weighted population: Instead of using geographic overlap as the weights for aggregation (e.g. #1), the population overlap is used instead. THIS IS THE ONE YOU USUALLY WANT TO USE!

THINGS TO BE AWARE OF

  1. BEWARE OF EDGE EFFECTS, especially when using the point pop option with the parcel population from the kcparcelpop package. Things like ZIP codes cross county boundaries. For ZIP codes specifically, I think this generally means all estimates in ZIPs that cross the boundary will be fully assigned to intersecting targets within King County. This may or may not be what you want. I encourage folks to double check the crosswalk values/weights/xwalk_df when using geographies that are outside King County and you are using the parcel pop option.
  2. BEWARE OF WATER, especially when using geographic overlap. In general, make sure source and target have similar ideas of what land is.
  3. THE proportion ARGUMENT IN CROSSWALK IS NOT JUST FOR proportion data. Basically, if your data is not count data, set proportion = T. Or at a minimum, try both and see which makes sense. The validation vignette describes this quandary in greater detail (and so does ?crosswalk).
  4. Don’t forget the rescale option exists. Review ?crosswalk to figure out if you want it toggled or not.

Other fun doodads

  1. The function assign_cases can take a set of crosswalk instructions and probabilistically apply it to microdata. This is a relatively new function, although it is tested. That said, best to manually review its results for sensibility.
  2. The function reduce_overlaps can be used to fix up geographies (in some cases) so that check_internal_consistency quiets down. This function does slightly weird things to the output geography, so don’t use it for mapping or anything. If you encounter a geography that you got from the APDE shapefile repository that fails a check_internal_consistency test, please let Daniel Casey know.
  3. Review the validation vignette to see how geographic overlap compares to population based ones

Installation

# You will need the remotes package
remotes::install_github('https://github.com/PHSKC-APDE/spatagg')

# You may also want to install the kcparcelpop package
# Useful for population weighted aggregation
remotes::install_github('https://github.com/PHSKC-APDE/kcparcelpop/')

Examples

Construct test geographies

The code below creates a 3 x 3 grid of squares as a set of “source” polygons. The source polygon contains the information that needs to be translated/aggregated to “target” polygon. In “real” instances, source polygons will be stuff like census tracts or ZIP codes.

Two target polygons are created for demonstration purposes. The first nests nicely with the source polygons (e.g. can be exactly constructed by source polygons) whereas the second overlaps the source polygons but does not nest. In this example, each “target” is just one polygon, but it could be a collection of polygons– e.g. cities, regions, or ZIP codes.

In general, you want the source polygons to be smaller than the target polygons.

library('spatagg')
library('sf', quietly = TRUE)
Linking to GEOS 3.9.3, GDAL 3.5.2, PROJ 8.2.1; sf_use_s2() is TRUE
library('ggplot2')
library('knitr', quietly = TRUE)
library('data.table')

set.seed(1)
#create a test grid of bottom left points
grid_size = 3
grid = expand.grid(x = seq_len(grid_size)-1, y = seq_len(grid_size)-1)

# given a bottom left corner, create a square
poly_points = function(x, y, step =1){
  mat = matrix(c(x, y, # bottom left
                 x+step, y, # bottom right
                 x+step, y+step, # top right
                 x, y+step, # top left
                 x, y # close
  ),ncol = 2, byrow = TRUE)
}

# remove the outbounds of the grid as they'll get computed
source_poly = lapply(seq_len(nrow(grid)), function(i) {
  x = grid[i,'x']
  y = grid[i, 'y']

  r = data.frame(id = i)
  r$geom = sf::st_sfc(sf::st_polygon(list(poly_points(x,y,1))))

  sf::st_sf(r, sf_column_name = 'geom')

})
source_poly = do.call(rbind, source_poly)

#add some random data to source_poly
source_poly$denom = round(runif(nrow(source_poly), 100,200))
source_poly$numer = round(runif(nrow(source_poly), 50,100))
source_poly$frac = with(source_poly, numer/denom)

# total overlap target
tp1 = data.frame(id = 1)
tp1$geom = sf::st_sfc(sf::st_polygon(list(poly_points(0,0,2))))
tp1 = sf::st_sf(tp1, sf_column_name = 'geom')

# partial overlap target
tp2 = data.frame(id = 1)
tp2$geom = sf::st_sfc(sf::st_polygon(list(poly_points(.5,.5,2))))
tp2 = sf::st_sf(tp2, sf_column_name = 'geom')

# graph them
  ggplot() + geom_sf(data = source_poly, fill = NA) +
    #geom_sf(data = target_poly, fill = NA, color = 'red') +
    geom_sf(data = tp1, fill = NA, color = 'purple', size = 1.2) +
    geom_sf_label(data = source_poly, aes(label = id)) +
    theme_classic() + 
    ggtitle('Source polygons nest within target',
            'Source = black, target = purple')

  ggplot() + geom_sf(data = source_poly, fill = NA) +
    geom_sf(data = tp2, fill = NA, color = 'purple', size = 1.2) +
    geom_sf_label(data = source_poly, aes(label = id)) +
    #geom_sf(data = overlap_poly, fill = NA, color = 'purple') +
    theme_classic() + 
    ggtitle('Source polygons do not cleanly nest within target',
           'Source = black, target = purple')

Geographic Overlap Aggregation

Aggregate between nesting geographies

The translation of a set of estimates/data from source to target occurs in two main steps. First, a set of weights that describe how to convert from source to target are generated via create_xwalk. Second, those weights are applied via crosswalk.

Construct the geographic crosswalk

The following command creates a set of geographic overlap weights. By default, create_xwalk expects that source and target overlap by at least 75%. This default is set high because I expect most instances of source/target overlap to broadly cover the same area (e.g. two different ways of carving up KC). However, tp1 and source_poly only overlap by 44%, so the min_overlap option is reduced to prevent an error.

cw1 = create_xwalk(source_poly, tp1,
                   source_id = 'id', target_id = 'id', min_overlap = .4)
knitr::kable(cw1)
source_id target_id s2t_fraction isect_amount tcoverage_amount target_amount
1 1 1 1 4 4
2 1 1 1 4 4
4 1 1 1 4 4
5 1 1 1 4 4

The resulting crosswalk data.frame returns a row per each source:target pair. s2t_fraction represents the percent of a source polygon that falls within the specified target polygon. isect_amount is the area of the intersection (in the units squared of the CRS of source_poly). tcoverage_amount is the area of target covered by source polygons. For example, in this example, 4 units^2 of target is covered by source polygons. target_amount is the total area of the target polygon (by target_id). Since tcoverage_amount == target_amount we can infer that tp1 is fully covered by source_poly (which was already known, but hey, its fun to confirm).

Convert between source and target

The crosswalk function has the following arguments:

# for counts
r = crosswalk(source_poly, source_id = 'id',
              est = 'numer',proportion = FALSE, xwalk_df = cw1)

# it should be equal to the four polygons tp1 overlaps with
r$est
[1] 266
r$est == sum(source_poly$numer[c(1,2,4,5)])
[1] TRUE
# for proportions
r2 = crosswalk(source_poly, source_id = 'id',
              est = 'frac',proportion = TRUE, xwalk_df = cw1)
r2$est
[1] 0.4675174
r2$est == mean(source_poly$frac[c(1,2,4,5)])
[1] TRUE
Proportions and counts are not the same thing

In general, and where possible, its better to convert counts to a new geography rather than metrics with the denominator already included. Or at least, its worth thinking about what you are actually converting from source to target. Using counts implicitly has the aggregation weights combine source population with the geographic overlaps. In the case below, things are close, but not the same.

r$est/sum(source_poly$denom[c(1,2,4,5)])
[1] 0.4626087
r2$est
[1] 0.4675174
r$est/sum(source_poly$denom[c(1,2,4,5)]) == r2$est
[1] FALSE

Aggregate between non-nesting geographies

Unlike the previous example (using tp1), tp2 does not cleanly intersect with source_poly. However, the general process is the same.

Construct the geographic crosswalk
cw2 = create_xwalk(source_poly, tp2,
                   source_id = 'id', target_id = 'id', min_overlap = .4)
knitr::kable(cw2)
source_id target_id s2t_fraction isect_amount tcoverage_amount target_amount
1 1 0.25 0.25 4 4
2 1 0.50 0.50 4 4
3 1 0.25 0.25 4 4
4 1 0.50 0.50 4 4
5 1 1.00 1.00 4 4
6 1 0.50 0.50 4 4
7 1 0.25 0.25 4 4
8 1 0.50 0.50 4 4
9 1 0.25 0.25 4 4
Convert between source and target
# for counts
r3 = crosswalk(source_poly, source_id = 'id',
              est = 'numer',proportion = FALSE, xwalk_df = cw2)

# it should be equal to the four polygons tp1 overlaps with
r3$est
[1] 299.75
compare = merge(cw2[,c('source_id', 's2t_fraction')], source_poly,
                by.x = 'source_id', by.y = 'id')
r3$est == sum(compare$numer * compare$s2t_fraction)
[1] TRUE
# for proportions
r4 = crosswalk(source_poly, source_id = 'id',
              est = 'frac',proportion = TRUE, xwalk_df = cw1)
r4$est
[1] 0.4675174

Point population overlap

In addition to converting data between source and target based on how they geographically interact, spatagg provides routines to do aggregation based on how population falls within both source and target. The kcparcelpop package provides a few datasets to facilitate this for King County. However, for this example, we will generate some population of our own.

ptpop = st_sample(source_poly, 100)
ptpop = data.frame(id = seq_len(100), 
                   pop = round(runif(100, 100,200)), 
                   ptpop)
ptpop = st_sf(ptpop, sf_column_name = 'geometry')

ggplot() + geom_sf(data = source_poly, fill = NA) +
    #geom_sf(data = target_poly, fill = NA, color = 'red') +
    geom_sf(data = ptpop, aes(color = pop)) +
    theme_minimal() + 
    ggtitle('Population points')

Create the point population crosswalk

# "Nested"
ppcw1 = create_xwalk(source_poly, tp1, source_id = 'id', target_id = 'id',
                     method = 'point pop', point_pop = ptpop, min_overlap = .1)
knitr::kable(ppcw1)
source_id target_id s2t_fraction isect_amount tcoverage_amount target_amount
1 1 1 669 6584 6584
2 1 1 2208 6584 6584
4 1 1 1949 6584 6584
5 1 1 1758 6584 6584
# "Not nested"
ppcw2 = create_xwalk(source_poly, tp2, source_id = 'id', target_id = 'id',
                     method = 'point pop', point_pop = ptpop, min_overlap = .1)
knitr::kable(ppcw2)
source_id target_id s2t_fraction isect_amount tcoverage_amount target_amount
1 1 0.1704036 114 7691 7691
2 1 0.5493659 1213 7691 7691
3 1 0.4249292 600 7691 7691
4 1 0.5987686 1167 7691 7691
5 1 1.0000000 1758 7691 7691
6 1 0.7321867 1192 7691 7691
7 1 0.3865415 494 7691 7691
8 1 0.5329341 890 7691 7691
9 1 0.1619458 263 7691 7691

The output of a “point pop” crosswalk is structured similarly to the fractional geographic overlap ones computed above. The main difference is the units of isect_amount, tcoverage_amount and target_amount are population instead of area (e.g. unit^2)

Convert estimates from source to target

Use the crosswalk function to convert from source to target.

# for counts
# nesting for fun
r5 = crosswalk(source_poly, source_id = 'id',
              est = 'numer',proportion = FALSE, xwalk_df = ppcw1)

# it should be equal to the four polygons tp1 overlaps with
r5$est
[1] 266
compare = merge(ppcw1[,c('source_id', 's2t_fraction')], source_poly,
                by.x = 'source_id', by.y = 'id')
r5$est == sum(compare$numer * compare$s2t_fraction)
[1] TRUE
# for proportions
# non-nesting, just for fun
r6 = crosswalk(source_poly, source_id = 'id',
              est = 'frac',proportion = TRUE, xwalk_df = ppcw2)
r6$est
[1] 0.4802856

Propagating uncertainty

Sometimes source estimates might include some metric of uncertainty that a user wishes to propagate to the target geography. Passing a column of standard errors via the se argument will compute, via simulation, a se for the target.

# create an se for source_poly
source_poly$se_frac = runif(nrow(source_poly), .01, .10)

r7 = crosswalk(source_poly, source_id = 'id',
              est = 'frac',proportion = TRUE, xwalk_df = ppcw2,
              se = 'se_frac')
print(r7)
  target_id       est         se
1         1 0.4806777 0.02674552

When source doesn’t totally cover target

sp = source_poly[c(1,2,4,5),]

ggplot() + geom_sf(data = sp, fill = NA, color = 'black', size = 1.1) +
  geom_sf(data = tp2, fill = NA, color = 'purple', size = 1.2) +
  #geom_sf(data = overlap_poly, fill = NA, color = 'purple') +
  theme_minimal() + 
  ggtitle('Source polygons do not totally encompass target',
         'Source = black, target = purple')

When scaling counts, the rescale argument (set to TRUE) can adjust the counts upwards to approximate full overlap.

# Geographic overlap
non_geo = create_xwalk(sp, tp2,
                   source_id = 'id', target_id = 'id', min_overlap = .1)

# no scaling
r8 = crosswalk(sp, source_id = 'id',
              est = 'numer',proportion = FALSE, xwalk_df = non_geo,
              rescale = FALSE)
print(r8)
  target_id    est
1         1 154.25
#yes scaling
r9 = crosswalk(sp, source_id = 'id',
              est = 'numer',proportion = FALSE, xwalk_df = non_geo,
              rescale = TRUE)
print(r9)
  target_id      est
1         1 274.2222
# Population overlap
non_pop = create_xwalk(sp, tp2, source_id = 'id', 
                       target_id = 'id', min_overlap = .1,
                       method = 'point pop', point_pop = ptpop)
# no scaling
r10 = crosswalk(sp, source_id = 'id',
              est = 'numer',proportion = FALSE, xwalk_df = non_pop,
              rescale = FALSE)
print(r10)
  target_id      est
1         1 161.2899
#yes scaling
r11 = crosswalk(sp, source_id = 'id',
              est = 'numer',proportion = FALSE, xwalk_df = non_pop,
              rescale = TRUE)
print(r11)
  target_id      est
1         1 291.7405

It should be noted that rescaling is not relevant when proportion = TRUE.

Assigning microdata to areas via assign_cases

Whereas the crosswalk function translates estimates from one areal unit to another via the output of a create_xwalk call, assign_cases does the same for microdata by probabilistically assigning each case from a source geography to a target geography.

For commonly used datasets, assign_cases should be run centrally and have its results saved/attached to the microdata for reproducibility across analysts. assign_cases is best used for aggregation or translation between similarly sized geographies (e.g. ZIP to HRAs). While it can be used for dis-aggregation, do so with caution.

# create some fake data
fake_micro = data.frame(rid = 1:10000)
fake_micro$id = sample(source_poly$id, nrow(fake_micro), replace = T)

# two split down the middle polygons
m1 = t(matrix(c(0,0,1.5,0,1.5,3,0,3,0,0), ncol = 5))
m2 = m1
m2[,1] <- m2[,1] + 1.5

tp3 = sf::st_sfc(st_polygon(list(m1)), st_polygon(list(m2)))
tp3 = st_sf(data.frame(id = 1:2, geom = tp3))

ggplot() + geom_sf(data = source_poly, fill = NA) +
    geom_sf(data = tp3, fill = NA, color = 'purple', size = 1.2) +
    geom_sf_label(data = source_poly, aes(label = id)) +
    #geom_sf(data = overlap_poly, fill = NA, color = 'purple') +
    theme_classic() + 
    ggtitle('Two target polyons with some down the middle splits',
           'Source = black, target = purple')

# create the crosswalk instructions
xw3 = create_xwalk(source_poly, tp3, 'id', 'id')

# Use assign cases
fake_micro$target_id = assign_cases(source = fake_micro, 'id', xw3)

# summarise results (using data.table because I can' bother with base R)
setDT(fake_micro)

fake_micro = fake_micro[, .N, keyby = .(id, target_id)]
fake_micro[, percent := round(100 *N/sum(N)), id]
# should be about 50/50
knitr::kable(fake_micro[id %in% c(2,5,8),])
id target_id N percent
2 1 556 51
2 2 541 49
5 1 524 49
5 2 549 51
8 1 518 48
8 2 566 52
# The rest should be 100%
knitr::kable(fake_micro[!id %in% c(2,5,8),])
id target_id N percent
1 1 1068 100
3 2 1118 100
4 1 1142 100
6 2 1133 100
7 1 1178 100
9 2 1107 100