Open ClaudiaHebert opened 9 months ago
@ClaudiaHebert
The goal is to turn the graphs into tables and translate those tables to actionable insights. "Calculate each of these four descriptive statistics above as a function of the 24 hours of the day, and either print a table with times and counts/rates, or plot a graph of the statistics as a function of time similar to the examples above." I'll walk you through the first table to prime your work. In the first graph we have Total Accidents on the Y graph and Hour on the X axis. Let's build a table:
# Code
dat %>%
group_by(hour) %>%
summarize(accidents = n())
Now build a table for each graph and interpret your results.
While this exercise is not asking for changing the Y and X variable, let's think about a good way to derive which is X and which is Y. The Y variable is the variable you are attempting to predict and the X variable is the variable that has some impact on Y. Let's say we want to predict the impact of caffeine on heart rate. We would have heart rate as our Y and caffeine as our X. We are trying to predict the effect of caffeine on heart rate. This makes sense if we think causally: an increase in caffeine could, theoretically, cause an increase in the heart rate. If we reversed this we could still run a regression model, but it make sense to think that an increase in heart rate causes us to drink more caffeine? Not really. So we wouldn't want to think about caffeine as our Y. This is the same with hours of the day and accidents. The hour of the day could predict accidents, but it is hard to argue the case that accidents in Tempe, AZ cause the hours of the day (note, we could mine the data and come up with evidence that this happens! We could even predict hour of day given the accidents, but we cannot conclude causality. But it is a spurious conclusion. This is why theory and research design is critical to program evaluation and research, more on that in other courses).
This is very helpful! I see now why it doesn't make sense to think of them as X and Y because of the lack of causal relationship. I also didn't realize we were making tables instead of graphs. Thanks for walking me through this.
I'm unsure how to edit the y variable being plotted in Part III. Q1 in Part III has the code completed for us already: group_by(hour) %>% summarize(n = n()) %>% plot(type = "b", bty = "n", pch = 19, cex = 2, xlab = "Hour", ylab = "Total Number of Accidents", main = "Total Number of Accidents by Time of Day")
When looking this over I was unsure where the x and y variables were being defined. This looks different from other times we've done plot functions and had to explicitly name x and y. I'm assuming the X must be coming from the group_by but where is the Y decided? I'm thinking it must be through the summarize(n=n) function but I'm not sure how.
For Q2 I attempted to change the argument in summarize to sum(injuries) - see code below: dat %>% group_by(hour) %>% summarize(injuries = n(), injuries = sum(injuries)) %>% plot(type = "b", bty = "n", pch = 19, cex = 2, xlab = "Hour", ylab = "Total Number of Accidents", main = "Total Number of Accidents by Time of Day")
This runs but the graph looks exactly the same as the one above so it's clearly not right (plus I need to factor in fatalities). @JasonSills can you offer some insight? Thank you!