elbersb / segregation

R package to calculate entropy-based segregation indices, with a focus on the Mutual Information Index (M) and Theil’s Information Index (H)
https://elbersb.com/segregation
Other
35 stars 3 forks source link

Calculating H (Theil's index) using tracts within counties #12

Closed kaseyzapatka closed 11 months ago

kaseyzapatka commented 11 months ago

Hi @elbersb,

I'm working on a project and we are using your segregation package to calculate a few segregation indices. Ultimately, my PI and I want to calculate H (Theil's index) using tract data within counties for 2021 5year ACS data so we have a segregation score for every county. This means my datafame is at the census tract level, nested within counties. My and my PI's understanding is to calculate H (Theil's index) for every county we need race data for every tract (maybe this is wrong?) Again, we ultimately want a measure of within county segregation.

Using the mutual_local function, I had two thoughts:

Option 1: My thought was to specify tract as the unit grouped by county, but this returns a score for every tract as shown below, which is too many.

  data %>% 

  group_by(county_fips) %>% 
  mutual_local(
              group = "race",         # characteristic
              unit = "tract_fips",    # spatial unit
              weight = "n",
              wide = TRUE
              )   %>% 

  glimpse() 

Rows: 63,949
Columns: 3
$ tract_fips <chr> "01000100", "01000200", "01000203", "01000300", "01000301", "01000302", "01000400", "01000401", "01000402", "01000403", "01000500", "01000501", "01000502", "…
$ ls         <dbl> 0.24915542, 0.18880942, 1.02384298, 0.35507251, 1.06922572, 1.55937782, 0.33665277, 2.07489647, 2.10448543, 0.99477404, 0.53212624, 1.19790121, 1.19779816, 0…
$ p          <dbl> 0.000031520160, 0.000047039131, 0.000015564463, 0.000047224133, 0.000012616556, 0.000013107874, 0.000067313572, 0.000004343007, 0.000002068387, 0.00001560995…
> 

Option 2: Alternatively, using the same data , we can specify county as the unit level and it returns a measure for every county. I find the same results using county-level data or tracts within county, which gives me pause the tract-level information is being used in calculations.


county_H <-
  segregation_prepared %>% 

  mutual_local(
              group = "race",         # characteristic
              unit = "county_fips",    # spatial unit
              weight = "n",
              wide = TRUE
              )   %>% 

  glimpse() 

Rows: 3,143
Columns: 3
$ county_fips <chr> "01001", "01003", "01005", "01007", "01009", "01011", "01013", "01015", "01017", "01019", "01021", "01023", "01025", "01027", "01029", "01031", "01033", "01…
$ ls          <dbl> 0.16215951, 0.15066213, 0.42666977, 0.21618153, 0.20360441, 0.88451879, 0.42987315, 0.15729486, 0.34755966, 0.28274865, 0.11379308, 0.42311843, 0.45856949, …
$ p           <dbl> 0.00017662875, 0.00068884879, 0.00007660615, 0.00006797170, 0.00017858492, 0.00003149893, 0.00005817263, 0.00035309676, 0.00010564546, 0.00007574483, 0.0001…

My question is which option gives me what I want (H score for every county)? I'm not confident the option 2 is using the tract-level data but option 1 returns a score for every tract, which is too granular. Should we use option 1 and average over tracts grouped by county somehow (this seems wrong). Or use option 2 and understand the function is using the tract data in the county-level calculations (this seems dubious because i get the same scores using tract or county-level data).

Also, I realize mutal_local calculates M, not H. Am I correct in understanding that H is just a normalized version of M? How is this best explained to my research team as to why we are using M and not H?

Thanks for your help. Best, Kasey

elbersb commented 11 months ago

mutual_local computes local segregation scores (see here or this paper for reference). If you simply want an H index for each county, you need counts by race and tract (or another spatial unit), and then use mutual_total, which will give you the H index.

kaseyzapatka commented 11 months ago

Hi @elbersb,

Thanks for responding! Used this package a few times and it's just so great/comprehensive. Thanks for making and maintaining.

I had seen that documentation you mentioned, but couldn't figure out how to use mutual_total to return H and M scores for every county across the country. I realized that I wanted to use the group_modify workflow you mention in the documentation, but kept getting the error about how my "group variable is constant". The answer was to filter out census tracts where they there was only 1 census tract per county. When I merge those back with the national county-level dataset, those counties will just be missing.

Including my code incase it is helpful for others with trying to calculate H or M for counties using tract data across the country.

data %>% 

 # group by county
  group_by(county_fips) %>% 

  # filter out where there is only one tract per county (which means no variation for calculations)
  mutate(count = n())  %>% 
  filter(count > 5) %>%  # adjust based on count of categories

  # group modify  
  group_modify(~mutual_total(data = .x, 
                             group = "race",
                             unit = "tract_fips",
                             weight = "n")) %>% 
glimpse()

Thanks!