AntoineSoetewey / statsandr

A blog on statistics and R aiming at helping academics and professionals working with data to grasp important concepts in statistics and to apply them in R. See www.statsandr.com
http://statsandr.com/
35 stars 15 forks source link

blog/chi-square-test-of-independence-in-r/ #46

Closed utterances-bot closed 3 years ago

utterances-bot commented 3 years ago

Chi-square test of independence in R - Stats and R

Learn when and how to use the Chi-square test of independence in R. See also how it works in practice and how to interpret the results of the Chi-square test

https://statsandr.com/blog/chi-square-test-of-independence-in-r/

AntoineSoetewey commented 3 years ago

Comment written by Herivelto Cordeiro dos Santos on November 30, 2020 11:55:35:

Hi Antoine! I realy liked your blog, I think is clear and focus! Let me ask you about this one... I have tried the scripts method #1 and #2 and I got different p-values. Do you know what could be the reason for that as they should be equal? My table is matrix (c(63,78,94,65), ncol=2).

AntoineSoetewey commented 3 years ago

Comment written by Herivelto Cordeiro dos Santos on November 30, 2020 11:55:35:

Hi Antoine! I realy liked your blog, I think is clear and focus! Let me ask you about this one... I have tried the scripts method #1 and #2 and I got different p-values. Do you know what could be the reason for that as they should be equal? My table is matrix (c(63,78,94,65), ncol=2).

Comment written by Antoine Soetewey on November 30, 2020 12:20:12:

Thank you for your feedback!

I just tried on my side, and I have the same p-values with your data, see my code here.

One potential reason you have different p-values is due to the fact that first method use the Yate's continuity correction by default. Add the argument correct = FALSE in the chisq.test() function to prevent from applying this continuity correction.

(I've added a note at the end of this section following your comment.)

Hope this helps.

Regards,
Antoine

Clarice3 commented 3 years ago

Hi Antoine! Thank you so much for your blog, it's very helpful! I have a question for you: I have a data frame with several (10) different categorical variables that I would like to test for possible correlations between each other. Is there a way that I can test them all at the same time, like you explained for the quantitative variables? Or is it really only possible for two variables at the same time? Not sure how I would do this for 10 variables... Thanks in advance! Regards, Clarice

AntoineSoetewey commented 3 years ago

Hi Antoine! Thank you so much for your blog, it's very helpful! I have a question for you: I have a data frame with several (10) different categorical variables that I would like to test for possible correlations between each other. Is there a way that I can test them all at the same time, like you explained for the quantitative variables? Or is it really only possible for two variables at the same time? Not sure how I would do this for 10 variables... Thanks in advance! Regards, Clarice

Dear Clarice,

Do you want to compute correlation coefficients or perform chi-square tests? You mentioned correlations but you posted the comment on the article about chi-square test, so I'm not sure.

For correlation, if your categorical variables are ordinal, you can simply use cor(dat, method = "spearman"), where dat is the name of your dataframe. See more details in this article about correlation coefficient in R.

The standard Chi-square test for independence (with the chisq.test() function and presented in this article) is only possible between two categorical variables at the same time, so you'd need to tweak your code a bit to do it for all possible pairs of variables. Or if your dataset contains a relatively small number of variables, you can copy paste your code for each pair of variables.

Hope this helps.

Regards, Antoine

Clarice3 commented 3 years ago

Hey Antoine! Thank you so much for your quick answer. My goal is to find out whether the different variables correlate with each other or not, so I can exclude them before computing a model. From your blog I learned that it’s not possible to compute correlation coefficients between two categorical variables (if I understood that correctly?) but only to do a contingency analysis.

Some of my categorical variables are ordinal with 3-4 levels, most of them are nominal though. I tried to use the corr() function that you suggested, but unfortunately I just can’t make it work with my R version… not sure why. So I’ll have to find another way I guess!

Best wishes, Hanna

Am 07.01.2021 um 16:51 schrieb Antoine Soetewey notifications@github.com:

Hi Antoine! Thank you so much for your blog, it's very helpful! I have a question for you: I have a data frame with several (10) different categorical variables that I would like to test for possible correlations between each other. Is there a way that I can test them all at the same time, like you explained for the quantitative variables? Or is it really only possible for two variables at the same time? Not sure how I would do this for 10 variables... Thanks in advance! Regards, Clarice

Dear Clarice,

Do you want to compute correlation coefficients or perform chi-square tests? You mentioned correlations but you posted the comment on the article about chi-square test, so I'm not sure.

For correlation, if your categorical variables are ordinal, you can simply use corr(dat, method = "spearman"), where dat is the name of your dataframe.

The standard Chi-square test for independence (with the chisq.test() function and presented in this article) is only possible between two categorical variables, so you'd need to tweak your code a bit to do it for all possible pairs of variables.

Hope this helps.

Regards, Antoine

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/AntoineSoetewey/statsandr/issues/46#issuecomment-756200776, or unsubscribe https://github.com/notifications/unsubscribe-auth/ASMKU2TX3SMDGHNWJ3O5SLTSYXKBTANCNFSM4VY6QD6Q.

AntoineSoetewey commented 3 years ago

Hey Antoine! Thank you so much for your quick answer. My goal is to find out whether the different variables correlate with each other or not, so I can exclude them before computing a model. From your blog I learned that it’s not possible to compute correlation coefficients between two categorical variables (if I understood that correctly?) but only to do a contingency analysis. Some of my categorical variables are ordinal with 3-4 levels, most of them are nominal though. I tried to use the corr() function that you suggested, but unfortunately I just can’t make it work with my R version… not sure why. So I’ll have to find another way I guess! Best wishes, Hanna

You understood correctly:

Regards, Antoine

Clarice3 commented 3 years ago

Alright, I get it. Sorry for the confusion and thanks for your help!

Best wishes, Hanna

Am 12.01.2021 um 10:42 schrieb Antoine Soetewey notifications@github.com:

Hey Antoine! Thank you so much for your quick answer. My goal is to find out whether the different variables correlate with each other or not, so I can exclude them before computing a model. From your blog I learned that it’s not possible to compute correlation coefficients between two categorical variables (if I understood that correctly?) but only to do a contingency analysis. Some of my categorical variables are ordinal with 3-4 levels, most of them are nominal though. I tried to use the corr() function that you suggested, but unfortunately I just can’t make it work with my R version… not sure why. So I’ll have to find another way I guess! Best wishes, Hanna … <x-msg://4/#> Am 07.01.2021 um 16:51 schrieb Antoine Soetewey @.***>: Hi Antoine! Thank you so much for your blog, it's very helpful! I have a question for you: I have a data frame with several (10) different categorical variables that I would like to test for possible correlations between each other. Is there a way that I can test them all at the same time, like you explained for the quantitative variables? Or is it really only possible for two variables at the same time? Not sure how I would do this for 10 variables... Thanks in advance! Regards, Clarice Dear Clarice, Do you want to compute correlation coefficients or perform chi-square tests? You mentioned correlations but you posted the comment on the article about chi-square test, so I'm not sure. For correlation, if your categorical variables are ordinal, you can simply use cor(dat, method = "spearman"), where dat is the name of your dataframe. The standard Chi-square test for independence (with the chisq.test() function and presented in this article) is only possible between two categorical variables, so you'd need to tweak your code a bit to do it for all possible pairs of variables. Hope this helps. Regards, Antoine — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#46 (comment) https://github.com/AntoineSoetewey/statsandr/issues/46#issuecomment-756200776>, or unsubscribe https://github.com/notifications/unsubscribe-auth/ASMKU2TX3SMDGHNWJ3O5SLTSYXKBTANCNFSM4VY6QD6Q https://github.com/notifications/unsubscribe-auth/ASMKU2TX3SMDGHNWJ3O5SLTSYXKBTANCNFSM4VY6QD6Q.

You understood correctly:

You can compute the correlation between your ordinal variables (thanks to the cor() function, with only one r and not two as you wrote in your comment), But for your nominal variables, you cannot compute the correlation. You'll need to apply the Chi-square test of independence (with the chisq.test() function). Regards, Antoine

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/AntoineSoetewey/statsandr/issues/46#issuecomment-758535113, or unsubscribe https://github.com/notifications/unsubscribe-auth/ASMKU2UHPW6LG7B4L3IXRI3SZQKSFANCNFSM4VY6QD6Q.

venkatpgi commented 3 years ago

Hello Mr Antoine, It was a nice article about Chi squared test. I am trying to replicate your earlier blog on using t.test and ANOVA on multiple columns at the same time in the case of Chisquared test. But unable to. I tried using looping comparison1 <- lapply(df[, 1:4], function(x)t.test(x~df$var)) - this worked for t.test but not for chi squared test Any suggestions?

AntoineSoetewey commented 3 years ago

Hello Mr Antoine, It was a nice article about Chi squared test. I am trying to replicate your earlier blog on using t.test and ANOVA on multiple columns at the same time in the case of Chisquared test. But unable to. I tried using looping comparison1 <- lapply(df[, 1:4], function(x)t.test(x~df$var)) - this worked for t.test but not for chi squared test Any suggestions?

Hello,

Here is a reproducible example using for loop:

df <- data.frame(sex = sample(c("male", "female"), size = 100, replace = TRUE),
                 smoke = sample(c("smoker", "non smoker"), size = 100, replace = TRUE),
                 sport = sample(c("athlete", "non athlete"), size = 100, replace = TRUE))

for (i in 2:ncol(df)) {
  print(names(df)[i])
  print(chisq.test(table(df[, 1], df[, i])))
}

Hope this helps.

Regards, Antoine