DS4PS / cpp-526-spr-2020

Course shell for CPP 526 Foundations of Data Science I for Spring 2020.
http://ds4ps.org/cpp-526-spr-2020/
3 stars 0 forks source link

Lab 5 Part II Problem 2 n.hour #23

Open JasonSills opened 4 years ago

JasonSills commented 4 years ago

Hi,

I think this is a simply issue, but I'm not able to finish it and was hoping for a pointer. I am attempting to create the n.hour column in the second problem of Part II. Similar to n.age using mutate(n.age=sum(n)) I'm attempting to use mutate. I see that it is a total of accidents I want in this field 1606. However, using n() or other functions simply don't get me there. What am I doing wrong or what should I try?

JasonSills commented 4 years ago

A bit of an update. I tried group_by(hour12)%>% and then mutuate(n.hour=sum(n)) and it worked... but I don't see why it worked. Perhaps I'm confused about group_by. Can you elaborate? How can we have two group_by and not change the nature of the output?

jamisoncrawford commented 4 years ago

Hi Jason - Can you reproduce the actual code you used? We'd be in a better position to explain if we could reproduce this on our machines. Thanks!

JasonSills commented 4 years ago
dat %>% 
  count(age,hour12)%>%
  group_by(age)%>%
  mutate(n.age=sum(n))%>%
  group_by(hour12)%>%
  mutate(n.hour=sum(n))%>%
  mutate(p.age=round(n/n.age,2))%>%
  mutate(p.hour=round(n/n.hour,2))%>%
  filter(hour12==" 7 AM")
jamisoncrawford commented 4 years ago

Thanks, Jason! I typically don't use multiple group_by() functions without using an ungroup() in between them, but it turns out that you get the same exact results when you use ungroup(), as well:

dat %>%
count(age, hour12) %>%
group_by(age) %>%
mutate(n.age = sum(n)) %>%
group_by(hour12) %>%
mutate(n.hour = sum(n)) %>%
mutate(p.age = round(n / n.age, 2)) %>%
mutate(p.hour = round(n / n.hour, 2)) %>%
filter(hour12 == " 7 AM")

... is the same as ...

dat %>%
count(age, hour12) %>%
group_by(age) %>%
mutate(n.age = sum(n)) %>%
ungroup() %>%
group_by(hour12) %>%
mutate(n.hour = sum(n)) %>%
mutate(p.age = round(n / n.age, 2)) %>%
mutate(p.hour = round(n / n.hour, 2)) %>%
filter(hour12 == " 7 AM")

The first group_by() gives you:

# A tibble: 211 x 4
# Groups:   age [9]
   age       hour12      n n.age
   <fct>     <fct>   <int> <int>
 1 Age 16-18 12 AM      24  1458
 2 Age 16-18 " 1 AM"    10  1458
 3 Age 16-18 " 2 AM"     5  1458
 4 Age 16-18 " 3 AM"     3  1458
 5 Age 16-18 " 4 AM"     4  1458
 6 Age 16-18 " 5 AM"     4  1458
 7 Age 16-18 " 6 AM"    24  1458
 8 Age 16-18 " 7 AM"    77  1458
 9 Age 16-18 " 8 AM"    72  1458
10 Age 16-18 " 9 AM"    42  1458

This allows the mutate() to create n.age when grouped on variable age (i.e. every age range).

Including the second grouping function:

dat %>%
count(age, hour12) %>%
group_by(age) %>%
mutate(n.age = sum(n)) %>%
ungroup() %>%
group_by(hour12) %>%              # Here
mutate(n.hour = sum(n)) %>%
mutate(p.age = round(n / n.age, 2)) %>%
mutate(p.hour = round(n / n.hour, 2)) %>%
filter(hour12 == " 7 AM")

....gives you:

# A tibble: 9 x 7
# Groups:   hour12 [1]
  age        hour12      n n.age n.hour p.age p.hour
  <fct>      <fct>   <int> <int>  <int> <dbl>  <dbl>
1 Age 16-18  " 7 AM"    77  1458   1606  0.05   0.05
2 Age 18-25  " 7 AM"   408  8796   1606  0.05   0.25
3 Age 25-35  " 7 AM"   371  5456   1606  0.07   0.23
4 Age 35-45  " 7 AM"   243  3250   1606  0.07   0.15
5 Age 45-55  " 7 AM"   175  2679   1606  0.07   0.11
6 Age 55-65  " 7 AM"   116  1878   1606  0.06   0.07
7 Age 65-75  " 7 AM"    39   970   1606  0.04   0.02
8 Age 75-100 " 7 AM"    17   570   1606  0.03   0.01
9 NA         " 7 AM"   160  3413   1606  0.05   0.1

... which allows you to use mutate() on new groups comprised of all permutations of age (i.e. every age range) and hour12 (every hour interval). The proportions get really small because each hour (in 24 hours) have their own distribution of accidents across age the half dozen or so age groups.

@JasonSills does that sort of help?

@lecy what are your thoughts?

lecy commented 4 years ago

Yeah, group_by() is extremely powerful for doing these sorts of group margin exercises and adding calculations from multiple levels of the data back to the original dataset.

For example, if you had a dataset of individual incomes and you wanted to add block-level, county-level, and state-level averages back to your data you originally had to segment the data by each, calculate means, then do complex merges to get it all pasted back together. Now you just group, mutate, ungroup, and you are good to go!

What's tricky, though, is that dplyr automatically drops one level of groups when you call certain functions like summarize(). So you need to be careful.

# Each call to summarise() removes a layer of grouping
by_vs_am <- mtcars %>% group_by(vs, am)
by_vs <- by_vs_am %>% summarise( n = n() )
by_vs
#> # A tibble: 4 x 3
#> # Groups:   vs [2]        <<<-----------------
#>      vs    am     n
#>   <dbl> <dbl> <int>
#> 1     0     0    12
#> 2     0     1     6
#> 3     1     0     7
#> 4     1     1     7
by_vs %>% summarise( n = sum(n) )
#> # A tibble: 2 x 2
#>      vs     n
#>   <dbl> <int>
#> 1     0    18
#> 2     1    14

Or to Jamison's point, if you are not paying attention and your data is still grouped, then calling summarise( n = sum(n) ) might not give you at all what you expect.

lecy commented 4 years ago

Perhaps I'm confused about group_by. Can you elaborate? How can we have two group_by and not change the nature of the output?

I think I understand your question now. Summarize creates a summary table. Mutate, on the other hand, is conducting a variable transformation to extend the current dataset. As such it always adds data back to the original data frame.

dplyr is clever enough to know if we are adding data back, the new vector needs to be the same size as the original data frame, so it scales each appropriately:

library( dplyr )
x <- 1:12
f <- sample( c("A","B","C"), size=12, replace=TRUE )
d <- data.frame( f, x )
d <- arrange( d, f )
> d 
   f  x
1  A 10
2  B  1
3  B  3
4  B  4
5  B  6
6  C  2
7  C  5
8  C  7
9  C  8
10 C  9
11 C 11
12 C 12
> 
> d %>%
+   group_by( f ) %>%
+   summarize( n=n() )
# A tibble: 3 x 2
  f         n
  <fct> <int>
1 A         1
2 B         4
3 C         7
> 
> d %>%
+   group_by( f ) %>%
+   mutate( n=n() )
# A tibble: 12 x 3
# Groups:   f [3]
   f         x     n
   <fct> <int> <int>
 1 A        10     1
 2 B         1     4
 3 B         3     4
 4 B         4     4
 5 B         6     4
 6 C         2     7
 7 C         5     7
 8 C         7     7
 9 C         8     7
10 C         9     7
11 C        11     7
12 C        12     7