cis-ds / Discussion

Public discussion
10 stars 15 forks source link

hw02: testing relationship between race and mental illness #169

Closed sizhenf closed 3 years ago

sizhenf commented 3 years ago

Hi everyone,

I'm posting this question because I'm not sure what would be the best way to test the relationship between mental illness and race in the mass shooting dataset.

Initially, I thought of using linear regression:

lm(prior_mental_illness ~ race, data = mass_shootings, na.action = na.omit)

But I got this error msg :

Error in lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) : 
  NA/NaN/Inf in 'y'
In addition: Warning message:
In storage.mode(v) <- "double" : NAs introduced by coercion

Despite being confused why I still get warned NA/NaN/Inf in 'y' with na.action = na.omit included, I thought maybe the regression isn't working because I'm regressing a categorical variable on another. After some googling, I learned that the Chi-squared test would be useful to test the relationship between two categorical variables. Thus I tried:

chisq.test(table(mass_shootings$race, mass_shootings$prior_mental_illness))

And I got:

    Pearson's Chi-squared test

data:  table(mass_shootings$race, mass_shootings$prior_mental_illness)
X-squared = NaN, df = 5, p-value = NA

Warning message:
In chisq.test(table(mass_shootings$race, mass_shootings$prior_mental_illness)) :
  Chi-squared approximation may be incorrect

Now I'm very confused....Any insights on why my testings aren't working or what test may be suitable would be greatly appreciated!

Thanks, Serena

bensoltoff commented 3 years ago

First of all, I don't expect any one to use statistical tests for that question. You are fine answering using a combination of tables and figures generated by ggplot2.

As to the error with lm(), I believe it because the outcome of interest is a categorical variable. What you really want is logistic regression, which uses the glm() function.

Finally, for the chi-square test I'm pretty sure the issue is that you are using the original columns from the data frame which contain missing values. You have to remove those observations first, then use that version of the data frame to conduct the test.

sizhenf commented 3 years ago

Got it. Thank you Professor Soltoff!