cssearcy / AYS-R-Coding-SPR-2020

Coding in R for Policy Analytics
https://cssearcy.github.io/AYS-R-Coding-SPR-2020/
3 stars 3 forks source link

Lab 5 - Table Reporting # of Accidents, Injuries, Fatalities, and Proportion of Accidents Causing Harm #19

Open adrianc09 opened 4 years ago

adrianc09 commented 4 years ago

Hi all,

I'm having some issues creating the table needed for Part 1, Question 6. I'm very close, but my issue is that my new table repeats the days with different harm rates, Totalinjuries/fatalities, and n like so:

Mon 0 0 0.3036178 2872 Mon 0 1 0.3036178 6 Mon 0 2 0.3036178 1 Mon 1 0 0.3036178 900 Mon 1 1 0.3036178 1 Mon 1 3 0.3036178 1 Mon 2 0 0.3036178 230 Mon 3 0 0.3036178 60 Mon 3 1 0.3036178 1 Mon 4 0 0.3036178 16. The table goes on and on until Sunday.

Here is my current code: grouped.dat <- group_by( dat, day, Totalinjuries, Totalfatalities, harm.rate = mean(Totalfatalities > 0 | Totalinjuries > 0) ) dplyr::summarize( n=n(), grouped.dat )

Can anyone provide me with a hint to create a more concise table like we see in the lab's instructions? Please let me know where I can expand on explaining my issue.

jamisoncrawford commented 4 years ago

The table goes on and on until Sunday.

@adrianc09 this sounds like an expression, lol.

The more variables you input into group_by(), the more granular the subgroups are going to be. For example, using The Largest Vocabulary In HipHop data from The Pudding:

hiphop %>%
    group_by(era) %>%
    summarize(n = n())

# A tibble: 6 x 2
  era       n
  <chr> <int>
1 1980s     8
2 1990s    44
3 1999s     1
4 2000s    50
5 2010s    57
6 NA       26

Here, we're only getting six groupings - one for each era as well as NA values in variable era.

What if we add another variable, like era and source?

hiphop %>%
    group_by(era, source) %>%
    summarize(n = n())

# A tibble: 16 x 3
# Groups:   era [6]
   era   source     n
   <chr> <chr>  <int>
 1 1980s site       8
 2 1990s new        1
 3 1990s poster     2
 4 1990s site      39
 5 1990s NA         2
 6 1999s poster     1
 7 2000s new       10
 8 2000s poster     8
 9 2000s site      30
10 2000s NA         2
11 2010s new       43
12 2010s poster     8
13 2010s site       6
14 NA    new        2
15 NA    site       2
16 NA    NA        22

Now we have 16 unique groupings for every possible permutation (i.e. combination) of era and source.

Because you have so many variables in group_by(), you're getting a ton of aggregations!

Hint: Try using only one variable in group_by() and you should be able to get to the right answer. However, summarize() needs to have more values in it, i.e. injuries, fatalities, and harm.rate.

adrianc09 commented 4 years ago

Thanks for the hint! I took my variables out of group_by() except for day, and put the values Totalinjuries, Totalfatalities, and harm.rate in summarize(). However, this code ends up with a table with a lot more rows than anticipated (specifically 28,470 of them):

dat %>% group_by( day ) %>% summarize( Totalinjuries, Totalfatalities, harm.rate = mean(Totalinjuries > 0 | Totalfatalities > 0), n=n() ).

I also run into the same problem with the days repeating with different values in each column.

jamisoncrawford commented 4 years ago

You're welcome! So one thing to note is that in summarize() now, TotalInjuries and TotalFatalities aren't being summarized in any way via a summary function (e.g. n(), mean(), etc.). So it's like:

You: "R, summarize something for me." R: "K." You: "Summarize every value in variable TotalInjuries." R: "Wha... what?" You: "Did I stutter?!" R: [Nervously spits all the values back at you]. R: "Did I do good?"

😃

adrianc09 commented 4 years ago

I got the correct output, thanks to the interaction you typed up 😆. Thank you so much for your help!

jamisoncrawford commented 4 years ago

Haha, I'm sure it was only in part thanks to my hypothetical dialogue :stuck_out_tongue_closed_eyes:.

Glad you figured it out!