DS4PS / cpp-526-spr-2020

Course shell for CPP 526 Foundations of Data Science I for Spring 2020.
http://ds4ps.org/cpp-526-spr-2020/
3 stars 0 forks source link

Lab 06 #27

Open jpapczyn opened 4 years ago

jpapczyn commented 4 years ago

When I summarize I get the write output, but then whenever I try to join with the next pipe all my outputs change to the same value. I saw that you posted a video but I am unable to find the link now.

Salaries %>% group_by(teamID, yearID) %>% summarize(team.budget = sum(salary.adj), n = n()) %>% right_join(Salaries)

vs

Salaries %>% group_by(teamID, yearID) %>% summarize(team.budget = sum(salary.adj), n = n())

jamisoncrawford commented 4 years ago

I actually remade the video to be a bit clearer in demonstrating multiple ways to get proportions and rejoin them to existing datasets - you can find that here:

https://www.youtube.com/watch?v=578GmqHqoJ4

(Bookmarks coming soon).

Understood - let's break it down operation-by-operation:

Salaries %>%
  group_by(teamID, yearID) %>%
  summarize(team.budget = sum(salary.adj), n = n())

# A tibble: 918 x 4
# Groups:   teamID [35]
   teamID yearID team.budget     n
   <fct>   <int>       <dbl> <int>
 1 ANA      1997   54596239.    31
 2 ANA      1998   70278139.    34
 3 ANA      1999   91548199.    40
 4 ANA      2000   82584880.    30
 5 ANA      2001   74058241.    30
 6 ANA      2002   93359559.    28
 7 ANA      2003  116060667.    27
 8 ANA      2004  143338396.    27
 9 ARI      1998   55068602.    36
10 ARI      1999  113557242.    34
# ... with 908 more rows

You've summarized your data by total observations (n()) in each possible grouping of team and year (~918). When you rejoin these data, variable n will duplicate across each match team and year ID variable - so they don't duplicate for all 26,428 observations, but they do duplicate for each combination of year and team. Check it out:

> tmp <- Salaries %>%
+   group_by(teamID, yearID) %>%
+   summarize(team.budget = sum(salary.adj), n = n()) %>%
+   right_join(Salaries)
Joining, by = c("teamID", "yearID")

> unique(tmp$n)
 [1] 22 25 28 21 20 23 19 18 29 31 30 27 24 32 26 15  9 35  4 37 34
[22] 33 36 17 40 39 38 41 43

Ultimately, when you join a much smaller dataset with matching values and variables in a much larger dataset, the smaller dataset will duplicate its values to "fill in the gaps" or fit the matching data. It's still a summary number, it's just added to un-summarized information!

Hope this helps!