Implementation problems

kaseyzapatka commented 3 years ago

Hi Arthur, Thanks for writing this package, it's a great implementation of Elizabeth Roberto's work.

I'm trying to use the Divergence Index (D) in my dissertation and have a few questions before I feel comfortable interpreting the estimates I'm getting. I have data on every census tract in the country for 2000, 2009-2013, and 2015-2019, which I group by CBSA (and potentially by counties within CBSAs). If I understand correctly, D should return a value for each census tract in my dataset grouped by CBSA.

So, for the New York Metro area as an example, I should have an estimate for each census tract (local unit) that reports "how surprising the composition of a local environment is given the overall population composition of the city (metro area in my case)." Lower scores mean there is no difference, whereas higher scores mean there is greater divergence from the metro area values. I am considering grouping tracts by counties which would mean that tracts would then report divergence from county averages. My questions are as follows:

I don't understand how the summed function works. I'm using the tidyverse approach and I get the same results whether or not the summed function is set to TRUE or FALSE. I see the same issue for bay_divergence used in vignette on the GitHub page. The example returns different results in the base R examples depending on whether the summed function is set to TRUE or FALSE (so it seems the tidyverse example does not work) but also I don't think they are returning the same tract-level estimates. It's hard to tell because you loose tract ids when you run divergence command in Base R.
Ideally, I'd like to get scores for each tract and a summary score for CBSA. So, I assume the groupCol is to create is both divergent unit scores for each unit and a divergent score for the entire CBSA. But is there a vignette for using the decompose_divergence function? Even after specifying groupCol, I still can't get any results. I'm using this page for guidance.

arthurgailes commented 3 years ago

These seem like two different issues. For the first point, summed will return one divergence score for the entire dataset (or one per-group if using group_by). So for tract-level results, it should be set to FALSE. But I'm confused by what you mean by "the tidyverse example does not work." Are you saying that you can't reproduce the output that code gives on your machine? A reproducible example would help.

On the second, if you're just trying to get Di for each tract, decompose_divergence isn't applicable; you should use divergence.

kaseyzapatka commented 3 years ago

Hi Arthur,

Sorry for the confusion in my post. Yes, there are two separate issues: (1) understanding the purpose of the summed function and (2) calculating scores for each census tract.

On the first point, as you say, when summed == T it will return one divergence score for the entire dataset (or one per-group if using group_by). I was playing with the example you provided on the GitHub page and it returns the same results whether or not summed == T for the tidyverse example, but returns different results for the base R example. Is this a bug or am I missing something here? Below the code I was using.


#1 tidyverse example from github : summed = T
bay_divergence_summed <- bay_race %>% 
  summarize(bay_divergence = divergence(white, black, asian, hispanic, all_other),
    population = total_pop, summed = T) %>% 
  glimpse()

# 2 tidyverse example from github : no summed = T
bay_divergence_nosummed <- bay_race %>% 
  mutate(bay_divergence = divergence(white, black, asian, hispanic, all_other),
    population = total_pop) %>% 
  glimpse()

#3 base R example from github : summed = T
bay_divergence_summmed_base <- divergence(bay_race[c('white','black','asian', 'hispanic', 'all_other')], 
  population=bay_race$total_pop, summed = TRUE) %>% 
  glimpse()

#4 base R example from github : no summed = T
bay_divergence_nosummmed_base <- divergence(bay_race[c('white','black','asian', 'hispanic', 'all_other')], 
  population=bay_race$total_pop) %>% 
  glimpse()

On the second point, I totally missed needing to use divergence. That solves my original question, but now I'm stuck wondering how the decompose_divergence function differs from simply using divergence and toggling the summed function on and off. Here's a link documentation I was looking at. I couldn't find an example using the bay_race data illustrating how to use this function.

Thanks so much for your help on this Arthur and sorry again for the confusion.

Best, Kasey

arthurgailes commented 3 years ago

That's a parenthesis typo in the readme, fixed. The code should be:

library(dplyr)
library(rsegregation)
bay_divergence_summed <- bay_race %>% 
  summarize(bay_divergence = divergence(white, black, asian, hispanic, all_other,
            population = total_pop, summed = T)) %>% 
  glimpse()
#> Rows: 1
#> Columns: 1
#> $ bay_divergence <dbl> 0.2575266

kaseyzapatka commented 3 years ago

Thanks, @arthurgailes. It works now!

Is there an example or vignette for the decompose_divergence function? I see the function documentation, which discusses the necessary data structure.

Thanks, Best, Kasey

arthurgailes / rsegregation

Implementation problems #1