Lab 2 Q6 & Q7 - Githubissues

AprilPeck commented 4 years ago

On these ones, when creating the tables is there a way (or should we even do this) to group the values? E.g. on Q6 have the table by age range instead of each individual age? It would seem to make more sense that way, esp. on Q7 where you appear to have one income listed per person. If so, how would we go about doing that?

Female Male

22 1 0

24 1 0

26 1 0

27 1 0

29 0.22 0.78

30 0.71 0.29

31 0.33 0.67

32 0.67 0.33

33 0.22 0.78

34 0.17 0.83

35 0.5 0.5

36 0.73 0.27

37 0.58 0.42

38 0.38 0.62

39 0.67 0.33

40 0.64 0.36

41 0.78 0.22

42 0.47 0.53

43 0.89 0.11

44 0.77 0.23

45 0.33 0.67

46 0.8 0.2

47 0.67 0.33

48 0.71 0.29

49 0.64 0.36

50 0.5 0.5

51 0.61 0.39

52 0.65 0.35

53 0.5 0.5

54 0.71 0.29

55 0.83 0.17

56 0.76 0.24

57 0.48 0.52

58 0.64 0.36

59 0.42 0.58

60 0.52 0.48

61 0.5 0.5

62 0.41 0.59

63 0.38 0.62

64 0.43 0.57

65 0.41 0.59

66 0.09 0.91

67 0.43 0.57

68 0.57 0.43

69 0 1

70 0.38 0.62

71 0.33 0.67

72 0.7 0.3

73 0.5 0.5

74 0.33 0.67

75 0.8 0.2

76 0 1

77 0 1

78 0.5 0.5

79 0.33 0.67

80 0.33 0.67

81 0 1

82 0 1

85 1 0

MeghanPaquette commented 4 years ago

I was wondering the same thing, April. I don't want to include excess code, so a way to group would be a good idea.

CamarenaL commented 4 years ago

What you are doing with your current code is simply listing all of the ages within the data by male and female. While this gives us a list of all ages what we really want is information that is easy to read and understand such as who the youngest entrepreneur is, the oldest, and the median age of the entrepreneurs in the data.

What you are trying to do is rather that list all of the items in the data is get all of the crucial information about all of the ages / incomes. We call these summary statistic tables. In a basic summary statistics tables you will find the minimum, median, mean, and maximum outputs of the variable of interest.

Using the dplyr package you will want to group your items by gender. We do this by using the code _groupby - this will allow us to create a summary statistics table on the gender group only.

You will then use the code summarize. summarize allows us to get summary statistics on the variable we are interested in. We want specific information though from the summarize function, so once we type summarize we need to tell R what specifically we are interested in.

So as an example for Q6, you should try the following:

dat %>% 
  group_by( gender ) %>% 
  summarize( min.age=min(age, na.rm=T), 
             median.age=median(age, na.rm=T), 
             mean.age=mean(age, na.rm=T), 
             max.age=max(age, na.rm=T) ) %>% 
  pander()

To get a better understanding of how these codes can be used type the following into R:

??group_by 
??summarize

summarize is a function that is used often. We use this function to get an understanding of our data. As an evaluator you would want to make sure you have nothing weird in your data. For example, there's a problem if we see that someone listed their age as 8. They couldn't own a NP organization. It would also be unlikely that we would want someone to list their age as 105. When we go back to look at data this tells us something may be incorrect and we need to check our data. This isn't something you need to do in this course necessarily but data is never completely free of issues when we are the ones collecting the data. Data that we find online has usually gone through checks and has been cleaned of issues. When we collect our own data, or we get data from someone/download, this is how we ensure the data is not incorrect or that it's plausible.

Niagara1000 commented 4 years ago

Professor @CamarenaL @lecy ,

I tried the above code and I got an error when running the chi-square test on the resulting summarization table.

Here is the question for reference:

Question 6

Compare age at the time of nonprofit formation for male and female entrepreneurs.

Variable Name: age Variable Type: numeric Survey question: What was your age when you created the nonprofit?

My code:

library(dplyr)
library(pander)
t_q6 <- dat %>% 
  group_by( gender ) %>% 
  summarize( min.age=min(age, na.rm=T), 
             median.age=median(age, na.rm=T), 
             mean.age=mean(age, na.rm=T), 
             max.age=max(age, na.rm=T) ) %>% 
  pander()

Output:

gender	min.age	median.age	mean.age	max.age
Female	22	52	51.94	85
Male	29	57	55.05	82

My code

chisq.test(t_q6)

Output

Error in chisq.test(t_q6) : at least one entry of 'x' must be positive

My code

chisq.test(t_q6, simulate.p.value = TRUE, B=10000)

Output:

Error in chisq.test(t_q6, simulate.p.value = TRUE, B = 10000) : at least one entry of 'x' must be positive

What should I change to make the chi-square tests work?

lecy commented 4 years ago

The problem is you are using the test for the wrong type of data.

A contrast is a comparison of traits of study subjects across groups.

When the variable is categorical, the chi-square tells you if the proportion of each level of the factor (for example white, black, and asian in a race variable) are independent from group structure (treatment and control categories in this case). If you have a statistically significant results it tells you that you can guess the race based upon the study group, i.e. that the groups are NOT equivalent. We are hoping for non-significant results.

If you are comparing numeric variables, however, the comparison takes the form of a difference of means, and the appropriate test is a t-test, not a chi-square. Look back over the notes for the example of code.

The table is only used to get a look at the information on the variable. You are getting an error because the program does not know what you are asking it to test. You will run the test on the variable of interest. In this case, you will need to test the variables age and gender.

Niagara1000 commented 4 years ago

q6lab02

Was this what I was supposed to do? If so, is it possible to get a corrected alpha from a t-test?

Is this the right way to state the results:

"the p-value of 0.0051 is below the alpha level of 0.05. Therefore, there is a statistically significant difference in the means of starting ages of females and males. So, the difference of the means shown in the output (51.939 vs. 55.048), though mathematically might seem small, is statistically significant, supporting the alternative hypothesis that females and males, on average, tend to start nonprofits at different ages."

Was it too wordy? haha

Thank you Professors! @lecy @CamarenaL

MeghanPaquette commented 4 years ago

Thank you for the feedback on both parts to the questions @lecy and @CamarenaL - Meghan

CamarenaL commented 4 years ago

@Niagara1000 That is what you needed to do. There's nothing necessarily wrong with how you stated the results. It's more than what you need but there's nothing wrong with it. You're interpreting it correctly.

I'm not quite sure I understand what you mean about getting a corrected alpha from a t-test. Are you referring to 8? If so, the corrected alpha is just needed for 8A. 8B is about which of the 7 contrasts that you've done tests on has the lowest p-value. You then use 8B and look at all the other p-values to answer 8c.

AprilPeck commented 4 years ago

On 8A, do we use 7 as the number of contrasts (since we looked at 7 contrasts in the lab), or 11 since there are 11 columns in the original table?

CamarenaL commented 4 years ago

@AprilPeck Please look at the other Lab2 issue open on the discussion board from the other day. Another one of your colleagues asked this question.

DS4PS / cpp-524-sum-2020

Lab 2 Q6 & Q7 #3

85 1 0

Question 6