elbersb / segregation

R package to calculate entropy-based segregation indices, with a focus on the Mutual Information Index (M) and Theil’s Information Index (H)
https://elbersb.com/segregation
Other
35 stars 3 forks source link

local segregation scores by a group within data structure #7

Closed kaseyzapatka closed 3 years ago

kaseyzapatka commented 3 years ago

Hi Ben,

Thanks so much for putting segregation together, it's a great package, very well documented, and tidyverse friendly. I'm hoping to use it to create measures of M and H to use in my dissertation analyzes. I have what I'm sure is a fairly simple coding problem (I'm a recent R convert from Stata so my question might be due a lack of sophisticated R skills).

I have data on every census tract in the country for 2000, 2009-2013, and 2015-2019 for five race-ethnic group: non-Hispanic White, non-Hispanic Black, non-Hispanic Asian, Hispanic, and non-Hispanic Other. Here is a dropbox link to sample data with all 2019 census tracts in NYC and Philly CBSAs.

I want to group the data by CBSA (and potentially by counties within CBSAs if possible) and calculate local segregation scores based on each tract's deviation from its respective CBSA distribution, instead of all tracts in the country. I've broken my data into one dataset for each year to reduce coding complexity. So, the following example data frame contains all census tracts in NYC and Philly CBSAs and takes the following format:

Rows: 30,095 
Columns: 4

$ tractid <chr> "10003000200", "10003000200", "10003000200", "10003000200", "10003000200", "10003000300", "10003000300", "10003000300", "10003000300", "10003000300", "10003000400", "10003000400", "10003000400", "10003000400", "10003000400", "10003000500", "10003000500", "10003000500", "10003000500", "1000300…
$ CBSA    <dbl> 37980, 37980, 37980, 37980, 37980, 37980, 37980, 37980, 37980, 37980, 37980, 37980, 37980, 37980, 37980, 37980, 37980, 37980, 37980, 37980, 37980, 37980, 37980, 37980, 37980, 37980, 37980, 37980, 37980, 37980, 37980, 37980, 37980, 37980, 37980, 37980, 37980, 37980, 37980, 37980, 37980, 37980,…
$ race    <chr> "hisp", "nhwht", "nhblk", "nhasian", "nhother", "hisp", "nhwht", "nhblk", "nhasian", "nhother", "hisp", "nhwht", "nhblk", "nhasian", "nhother", "hisp", "nhwht", "nhblk", "nhasian", "nhother", "hisp", "nhwht", "nhblk", "nhasian", "nhother", "hisp", "nhwht", "nhblk", "nhasian", "nhother", "hisp…
$ n       <dbl> 583, 1542, 3102, 54, 70, 245, 371, 2356, 0, 28, 129, 1266, 1421, 0, 129, 13, 249, 3510, 0, 137, 76, 93, 3081, 0, 61, 47, 177, 2496, 57, 9, 175, 53, 2030, 15, 71, 72, 2531, 672, 75, 30, 65, 1637, 73, 6, 12, 50, 3170, 97, 94, 113, 302, 1188, 648, 65, 29, 121, 1112, 897, 23, 99, 214, 512, 1478, …

Running the following code produces local segregation scores for each tract:

local <- NYC_Philly_2019 %>% 

  # create local segregation measures
  mutual_local(data = ., 
                      group = "race",
                      unit = "tractid", 
                      weight = "n",
                      wide = TRUE) %>% 
  glimpse()

Rows: 5,951 
Columns: 3

$ tractid <chr> "10003000200", "10003000300", "10003000400", "10003000500", "10003000601", "10003000602", "10003000900", "10003001100", "10003001200", "10003001300", "10003001400", "10003001500", "10003001600", "10003001902", "10003002100", "10003002200", "10003002300", "10003002400", "10003002500", "1000300…
$ ls      <dbl> 0.45510596, 0.94897736, 0.39931576, 1.36705316, 1.45092867, 1.28339985, 1.25663503, 0.24859915, 0.41385348, 0.41812760, 0.09038234, 0.26616153, 0.66185397, 0.68981707, 0.95899577, 0.62172801, 0.65428158, 0.31867252, 0.16160455, 0.46131412, 0.36911558, 0.29864495, 1.45383063, 0.88663285, 0.240…
$ p       <dbl> 2.108904e-04, 1.182342e-04, 1.160666e-04, 1.540592e-04, 1.304912e-04, 1.098002e-04, 9.238033e-05, 1.332105e-04, 7.066465e-05, 1.388858e-04, 8.796626e-05, 8.875448e-05, 8.713862e-05, 8.465570e-05, 7.377815e-05, 9.923792e-05, 9.186798e-05, 1.712820e-04, 1.249342e-04, 1.438910e-04, 1.008538e-04,…

However, I want local segregation scores that report each tract's deviation from its respective CBSA distribution, not all tracts in the country. Do you know how I could do this? I tried to simply group_by CBSA before I ran the mutual_local function but I seem to be getting the same scores regardless. My second thought would be to loop the mutual_local function over a vector of CBSA values but that just got me two sets of the same results for each CBSA.

Here's my coding attempts at producing different scores by CBSA using group_by and map_df:

# group by  CBSA
local_group <- NYC_Philly_2019 %>% 

  # group by CBSA
  group_by(CBSA) %>%

  # create local segregation measures
  mutual_local(data = ., 
                      group = "race",
                      unit = "tractid", 
                      weight = "n",
                      wide = TRUE) %>% 
  glimpse()

# map_df

cbsa <-  c("35620", "37980") 
  glimpse()

ls_by_CBSA <- map_df(cbsa, function(x) {

  NYC_Philly_2019 %>% 

  mutual_local(data = ., 
               group = "race",
               unit = "tractid", 
               weight = "n",
               wide = TRUE) 

})

If I compare all three data frames, I'm getting the same results for each, but I would think they would differ if grouped by CBSA.


# compare all three returned dataframes 
# NYC tract
local %>%  filter(tractid == "34003001000") %>% glimpse()
local_group %>%  filter(tractid == "34003001000") %>% glimpse()
ls_by_CBSA %>%  filter(tractid == "34003001000") %>% glimpse()

# Philly tract
local %>%  filter(tractid == "10003000200") %>% glimpse()
local_group %>%  filter(tractid == "10003000200") %>% glimpse()
ls_by_CBSA %>%  filter(tractid == "10003000200") %>% glimpse()

I don't see a within function like for mutual_total but maybe I missed something? Any help would be greatly appreciated!

Thanks, Kasey.

elbersb commented 3 years ago

You're roughly on the right track. Your map_df approach works if you do the following:

cbsa <-  c(35620, 37980) 

ls_by_CBSA <- map_df(cbsa, function(x) {
  NYC_Philly_2019 %>% 
    dplyr::filter(CBSA == x) %>%
    mutual_local(data = ., 
               group = "race",
               unit = "tractid", 
               weight = "n",
               wide = TRUE) 
})

But then you don't have the cbsa codes in the data. So what I'd do is this:


NYC_Philly_2019 %>%
    group_by(CBSA) %>%
    group_modify(~mutual_local(data = .x, 
               group = "race",
               unit = "tractid", 
               weight = "n",
               wide = TRUE) )

This is the simplest approach, and you can group_by year and other characteristics as well.

Let me know if that works!

kaseyzapatka commented 3 years ago

Thanks, @elbersb. Your code worked perfectly and I was able incorporate group_modify into my general workflow, which made it much easier to get various segregation estimates from your segregation package with my data structure. Thanks for the coding tip!

One final question, is it possible to decompose differences over time by CBSA so I have a decomposition for each CBSA?

I don't think the group_modify approach works in this case because mutual_difference requires two inputs and I don't know how to pipe two inputs into the function. The map2_df approach worked (code below), but

  1. I'm still missing something bc it doesn't seem to have run the analysis by both CBSAs and reports the same statistics 5 times which is due to the input data structure I think and
  2. I lose the CBSA code as you point out in the previous post, so I don't know which decomposition goes with which metro.

Any thoughts? I can always do it manually but then I'll be choosy about which CBSAs I run the decomposition for. Thanks again!

Sample 2000 and 2019 data if helpful.


# test mapping function over both CBSAs ----------------------------------------

# trying to be fancy and split the dataset and loop over resulting list 
# by_information_data <- information_data %>% 
#   filter(year != 2013) %>%  
#   split(.$year) %>% 

  test <- map2_dfr(NYC_Philly_2019, NYC_Philly_2019, ~ mutual_difference(data1 = NYC_Philly_2000, 
                                                                         data2 = NYC_Philly_2019, 
                                                                         group = "race",
                                                                         unit = "tractid", 
                                                                         weight = "n", 
                                                                         method = "shapley"))

    print(test)
    View(test)
elbersb commented 3 years ago

group_modify takes any function, so the easiest is to put the two years together into one dataset (via bind_rows or so), and then have:

diff <- function(df, group) {
  y1 <- filter(df, year == 2000)
  y2 <- filter(df, year == 2005)
  mutual_difference(y1, y2, group = "race", unit = "tract",  weight = "n")
}

data %>%
 group_by(cbsa) %>%
 group_modify(diff)

I haven't tested this exact code, but I've used this pattern often in my own work.

kaseyzapatka commented 3 years ago

Thanks @elbersb, worked perfectly. That was so easy/obvious once I saw it in action. Thanks for all your help with learning this package. I really like it, it's versatility, and what this approach will allow me to do in my analyses. Looking forward to seeing what I find.

elbersb commented 3 years ago

Glad it worked!

elbersb commented 3 years ago

Finally added this to the FAQ: https://elbersb.github.io/segregation/articles/faq.html#how-can-i-compute-indices-for-different-areas-at-once-

Let me know what else would be useful to put there!