Open JasonSills opened 4 years ago
A bit of an update. I tried group_by(hour12)%>% and then mutuate(n.hour=sum(n)) and it worked... but I don't see why it worked. Perhaps I'm confused about group_by. Can you elaborate? How can we have two group_by and not change the nature of the output?
Hi Jason - Can you reproduce the actual code you used? We'd be in a better position to explain if we could reproduce this on our machines. Thanks!
dat %>%
count(age,hour12)%>%
group_by(age)%>%
mutate(n.age=sum(n))%>%
group_by(hour12)%>%
mutate(n.hour=sum(n))%>%
mutate(p.age=round(n/n.age,2))%>%
mutate(p.hour=round(n/n.hour,2))%>%
filter(hour12==" 7 AM")
Thanks, Jason! I typically don't use multiple group_by()
functions without using an ungroup()
in between them, but it turns out that you get the same exact results when you use ungroup()
, as well:
dat %>%
count(age, hour12) %>%
group_by(age) %>%
mutate(n.age = sum(n)) %>%
group_by(hour12) %>%
mutate(n.hour = sum(n)) %>%
mutate(p.age = round(n / n.age, 2)) %>%
mutate(p.hour = round(n / n.hour, 2)) %>%
filter(hour12 == " 7 AM")
... is the same as ...
dat %>%
count(age, hour12) %>%
group_by(age) %>%
mutate(n.age = sum(n)) %>%
ungroup() %>%
group_by(hour12) %>%
mutate(n.hour = sum(n)) %>%
mutate(p.age = round(n / n.age, 2)) %>%
mutate(p.hour = round(n / n.hour, 2)) %>%
filter(hour12 == " 7 AM")
The first group_by()
gives you:
# A tibble: 211 x 4
# Groups: age [9]
age hour12 n n.age
<fct> <fct> <int> <int>
1 Age 16-18 12 AM 24 1458
2 Age 16-18 " 1 AM" 10 1458
3 Age 16-18 " 2 AM" 5 1458
4 Age 16-18 " 3 AM" 3 1458
5 Age 16-18 " 4 AM" 4 1458
6 Age 16-18 " 5 AM" 4 1458
7 Age 16-18 " 6 AM" 24 1458
8 Age 16-18 " 7 AM" 77 1458
9 Age 16-18 " 8 AM" 72 1458
10 Age 16-18 " 9 AM" 42 1458
This allows the mutate()
to create n.age
when grouped on variable age
(i.e. every age range).
Including the second grouping function:
dat %>%
count(age, hour12) %>%
group_by(age) %>%
mutate(n.age = sum(n)) %>%
ungroup() %>%
group_by(hour12) %>% # Here
mutate(n.hour = sum(n)) %>%
mutate(p.age = round(n / n.age, 2)) %>%
mutate(p.hour = round(n / n.hour, 2)) %>%
filter(hour12 == " 7 AM")
....gives you:
# A tibble: 9 x 7
# Groups: hour12 [1]
age hour12 n n.age n.hour p.age p.hour
<fct> <fct> <int> <int> <int> <dbl> <dbl>
1 Age 16-18 " 7 AM" 77 1458 1606 0.05 0.05
2 Age 18-25 " 7 AM" 408 8796 1606 0.05 0.25
3 Age 25-35 " 7 AM" 371 5456 1606 0.07 0.23
4 Age 35-45 " 7 AM" 243 3250 1606 0.07 0.15
5 Age 45-55 " 7 AM" 175 2679 1606 0.07 0.11
6 Age 55-65 " 7 AM" 116 1878 1606 0.06 0.07
7 Age 65-75 " 7 AM" 39 970 1606 0.04 0.02
8 Age 75-100 " 7 AM" 17 570 1606 0.03 0.01
9 NA " 7 AM" 160 3413 1606 0.05 0.1
... which allows you to use mutate()
on new groups comprised of all permutations of age
(i.e. every age range) and hour12
(every hour interval). The proportions get really small because each hour (in 24 hours) have their own distribution of accidents across age the half dozen or so age groups.
@JasonSills does that sort of help?
@lecy what are your thoughts?
Yeah, group_by() is extremely powerful for doing these sorts of group margin exercises and adding calculations from multiple levels of the data back to the original dataset.
For example, if you had a dataset of individual incomes and you wanted to add block-level, county-level, and state-level averages back to your data you originally had to segment the data by each, calculate means, then do complex merges to get it all pasted back together. Now you just group, mutate, ungroup, and you are good to go!
What's tricky, though, is that dplyr automatically drops one level of groups when you call certain functions like summarize(). So you need to be careful.
# Each call to summarise() removes a layer of grouping
by_vs_am <- mtcars %>% group_by(vs, am)
by_vs <- by_vs_am %>% summarise( n = n() )
by_vs
#> # A tibble: 4 x 3
#> # Groups: vs [2] <<<-----------------
#> vs am n
#> <dbl> <dbl> <int>
#> 1 0 0 12
#> 2 0 1 6
#> 3 1 0 7
#> 4 1 1 7
by_vs %>% summarise( n = sum(n) )
#> # A tibble: 2 x 2
#> vs n
#> <dbl> <int>
#> 1 0 18
#> 2 1 14
Or to Jamison's point, if you are not paying attention and your data is still grouped, then calling summarise( n = sum(n) )
might not give you at all what you expect.
Perhaps I'm confused about group_by. Can you elaborate? How can we have two group_by and not change the nature of the output?
I think I understand your question now. Summarize creates a summary table. Mutate, on the other hand, is conducting a variable transformation to extend the current dataset. As such it always adds data back to the original data frame.
dplyr is clever enough to know if we are adding data back, the new vector needs to be the same size as the original data frame, so it scales each appropriately:
library( dplyr )
x <- 1:12
f <- sample( c("A","B","C"), size=12, replace=TRUE )
d <- data.frame( f, x )
d <- arrange( d, f )
> d
f x
1 A 10
2 B 1
3 B 3
4 B 4
5 B 6
6 C 2
7 C 5
8 C 7
9 C 8
10 C 9
11 C 11
12 C 12
>
> d %>%
+ group_by( f ) %>%
+ summarize( n=n() )
# A tibble: 3 x 2
f n
<fct> <int>
1 A 1
2 B 4
3 C 7
>
> d %>%
+ group_by( f ) %>%
+ mutate( n=n() )
# A tibble: 12 x 3
# Groups: f [3]
f x n
<fct> <int> <int>
1 A 10 1
2 B 1 4
3 B 3 4
4 B 4 4
5 B 6 4
6 C 2 7
7 C 5 7
8 C 7 7
9 C 8 7
10 C 9 7
11 C 11 7
12 C 12 7
Hi,
I think this is a simply issue, but I'm not able to finish it and was hoping for a pointer. I am attempting to create the n.hour column in the second problem of Part II. Similar to n.age using mutate(n.age=sum(n)) I'm attempting to use mutate. I see that it is a total of accidents I want in this field 1606. However, using n() or other functions simply don't get me there. What am I doing wrong or what should I try?