elbersb / segregation

R package to calculate entropy-based segregation indices, with a focus on the Mutual Information Index (M) and Theil’s Information Index (H)
https://elbersb.com/segregation
Other
35 stars 3 forks source link

Negative local segregation values in the decomposition into racial groups #9

Closed kaisarea closed 2 years ago

kaisarea commented 3 years ago

Hello, I have the following 'dataset' called local_data (trying to create a reproducible example here):

# A tibble: 14 x 3
   SCHOOLID   group   count
   <chr>      <chr>   <dbl>
 1 100005_870 WHITE     669
 2 100005_870 BLACK      12
 3 100005_870 HISP       80
 4 100005_870 AIAN        0
 5 100005_870 ASIAN       2
 6 100005_870 PACIFIC    16
 7 100005_870 TR         25
 8 100005_871 WHITE     703
 9 100005_871 BLACK      12
10 100005_871 HISP       47
11 100005_871 AIAN        0
12 100005_871 ASIAN       2
13 100005_871 PACIFIC     0
14 100005_871 TR         27

Then I run:

mutual_local(local_data, "SCHOOLID", "group", weight = "count", wide = TRUE) I get the following output:

     group         ls           p
1:   ASIAN  5.2951875 0.002507837
2:   BLACK  3.5034280 0.015047022
3:    HISP  1.8714444 0.079623824
4: PACIFIC  4.6020403 0.010031348
5:      TR  2.7309779 0.032601881
6:   WHITE -0.5422359 0.860188088

My question is how does one interpret negative values from the mutual_local() function? I actually even had all components being negative (I can try to create a reproducible example for that too if needed). What is the interpretation of a zero, positive, and negative values here?

elbersb commented 3 years ago

Hi, thanks for the issue. The local segregation scores can't be negative, so you found a bug. The problem is that your variable is named "group", and the package doesn't deal well with that. If you use "race", for instance, the problem goes away:

library(tibble)
library(segregation)
options(scipen=5)

local_data = tribble(~SCHOOLID, ~race, ~count,
"100005_870", "WHITE",     669,
"100005_870", "BLACK",      12,
"100005_870", "HISP",       80,
"100005_870", "AIAN",        0,
"100005_870", "ASIAN",       2,
"100005_870", "PACIFIC",    16,
"100005_870", "TR",         25,
"100005_871", "WHITE",     703,
"100005_871", "BLACK",      12,
"100005_871", "HISP",       47,
"100005_871", "AIAN",        0,
"100005_871", "ASIAN",       2,
"100005_871", "PACIFIC",     0,
"100005_871", "TR",         27)

(mutual_local(local_data, "SCHOOLID", "race", weight = "count", wide = TRUE))
#>       race            ls           p
#> 1:   ASIAN 0.00003321619 0.002507837
#> 2:   BLACK 0.00003321619 0.015047022
#> 3:    HISP 0.03206493691 0.079623824
#> 4: PACIFIC 0.68502974604 0.010031348
#> 5:      TR 0.00108653019 0.032601881
#> 6:   WHITE 0.00054228911 0.860188088

Created on 2021-10-24 by the reprex package (v2.0.1)

I'll try to fix that issue soon.

kaisarea commented 3 years ago

It's working now, thank you!

estedeahora commented 2 years ago

Hello, I have the same problem, but I can't resolve it whit the names changes. In my case, the problem arises when I use the "se" argument and the function make the bias corrections. Here is the code:

library(tidyverse)
library(segregation)
base  <- tribble(~ID_s, ~PRI, ~SEC, ~SUP,
                 1,     4,     4,     6,
                 2,    27,    34,    36,
                 3,     9,    15,    15,
                 4,    21,    33,    38,
                 5,    15,    23,    19,
                 6,     6,     8,     6,
                 7,     7,    14,    18,
                 8,     6,     8,    12,
                 9,    23,    34,    45,
                 10,    9,    16,    19
                 )
base |> 
  pivot_longer(cols = PRI:SUP, names_to = "EDU", 
               values_to = "n") |> 
  mutual_local(group = "EDU", unit = "ID_s",
               weight = "n", se = T, 
               wide = T ) |>
   select(ID_s, p, ls)

And this is my output:

      ID_s            p              ls
 1:    1     0.02539623     -0.072997966
 2:    2     0.18315094     -0.002555072
 3:    3     0.07269811     -0.024815143
 4:    4     0.17362264     -0.010504312
 5:    5     0.10986792     -0.004451141
 6:    6     0.03701887     -0.019953732
 7:    7     0.07281132     -0.010572342
 8:    8     0.04958491     -0.036701383
 9:    9     0.19315094     -0.004720493
10:   10   0.08269811     -0.017325337

The problem disappear when I select "se = F".

Thank you!

elbersb commented 2 years ago

Hi, yes that can happen when your sample is small. Basically this means that your ls scores are most likely exactly zero. I could probably just set them to 0 manually if this occurs, but I think this is probably more transparent. This is just something that can happen with the combination of bootstrap and bias correction when the parameters are close to 0. Maybe it would be good to have a FAQ entry about this, though.

estedeahora commented 2 years ago

Perfect. I did this manually but was not sure if it was correct. Thank you for your response and your work with this package!