Open jpapczyn opened 4 years ago
I actually remade the video to be a bit clearer in demonstrating multiple ways to get proportions and rejoin them to existing datasets - you can find that here:
https://www.youtube.com/watch?v=578GmqHqoJ4
(Bookmarks coming soon).
Understood - let's break it down operation-by-operation:
Salaries %>%
group_by(teamID, yearID) %>%
summarize(team.budget = sum(salary.adj), n = n())
# A tibble: 918 x 4
# Groups: teamID [35]
teamID yearID team.budget n
<fct> <int> <dbl> <int>
1 ANA 1997 54596239. 31
2 ANA 1998 70278139. 34
3 ANA 1999 91548199. 40
4 ANA 2000 82584880. 30
5 ANA 2001 74058241. 30
6 ANA 2002 93359559. 28
7 ANA 2003 116060667. 27
8 ANA 2004 143338396. 27
9 ARI 1998 55068602. 36
10 ARI 1999 113557242. 34
# ... with 908 more rows
You've summarized your data by total observations (n()
) in each possible grouping of team and year (~918). When you rejoin these data, variable n
will duplicate across each match team and year ID variable - so they don't duplicate for all 26,428 observations, but they do duplicate for each combination of year and team. Check it out:
> tmp <- Salaries %>%
+ group_by(teamID, yearID) %>%
+ summarize(team.budget = sum(salary.adj), n = n()) %>%
+ right_join(Salaries)
Joining, by = c("teamID", "yearID")
> unique(tmp$n)
[1] 22 25 28 21 20 23 19 18 29 31 30 27 24 32 26 15 9 35 4 37 34
[22] 33 36 17 40 39 38 41 43
Ultimately, when you join a much smaller dataset with matching values and variables in a much larger dataset, the smaller dataset will duplicate its values to "fill in the gaps" or fit the matching data. It's still a summary number, it's just added to un-summarized information!
Hope this helps!
When I summarize I get the write output, but then whenever I try to join with the next pipe all my outputs change to the same value. I saw that you posted a video but I am unable to find the link now.
Salaries %>% group_by(teamID, yearID) %>% summarize(team.budget = sum(salary.adj), n = n()) %>% right_join(Salaries)
vs
Salaries %>% group_by(teamID, yearID) %>% summarize(team.budget = sum(salary.adj), n = n())