Open mtwelker opened 3 years ago
Hey @mtwelker! Just by looking at the code but not rerunning anything, it looks as though Method 1 isn't actually grouping or summarizing anything, but rather shows the name, year, and n. In Method 2, it i showing you the aggregated n over 1946 - 1964, rather than parsed by individual year (like Method 1). There should be a lot more rows in the first method because of all the unique years.
Here's the documentation for distinct()
: https://dplyr.tidyverse.org/reference/distinct.html
Really, that's just eliminating duplicate rows if I understand the documentation correctly. I don't use distinct()
often in R, but do use it quite a bit in SQL.
If we're trying to "identify the top 10 male names for Baby Boomers," only one of those methods is producing the correct result, and I think it's Method 2. If so, I'm confused about why Method 1 is included as an example.
I'm not quite sure, either!
Goal is to find the most popular names during that period, so we don't care much about the counts other than to sort the list.
The main difference between the two approaches is that the first identifies counts by year for men born within the Boomer window, then sorts individual years, then keeps the first occurrence (the peak year for each name) and drops the rest with the distinct() function.
For example, James peaks in popularity in 1947:
> names %>%
+ filter( Gender =="M" & Year >= 1946 & Year <= 1964 & Name == "James" ) %>%
+ pander()
----------------------------------------
Id Name Year Gender Count
-------- ------- ------ -------- -------
427037 James 1946 M 87425
437158 James 1947 M 94755
447462 James 1948 M 88596
457723 James 1949 M 86856
468034 James 1950 M 86221
478442 James 1951 M 87175
489081 James 1952 M 87083
499839 James 1953 M 85946
510786 James 1954 M 86277
521861 James 1955 M 84130
533141 James 1956 M 84860
544604 James 1957 M 84242
556179 James 1958 M 78731
567869 James 1959 M 78597
579777 James 1960 M 76872
591897 James 1961 M 75896
604128 James 1962 M 72563
616410 James 1963 M 71332
628831 James 1964 M 73050
----------------------------------------
> names %>%
+ filter( Gender =="M" & Year >= 1946 & Year <= 1964 & Name == "James" ) %>%
+ arrange( desc( Count ) ) %>%
+ distinct( Name, .keep_all=T ) %>%
+ pander()
----------------------------------------
Id Name Year Gender Count
-------- ------- ------ -------- -------
437158 James 1947 M 94755
----------------------------------------
The second aggregates all instances for each name: group_by( Name ) %>% summarize( total=sum(Count) )
"Slightly different solutions but close" is referring to the final lists of names returned by each method. They both return the same set of five names. It's not referring to the counts.
As far as which is better, it just depends on how you operationalize the instructions "identify the most popular names given to Boomer men". Highest count in a given year in that window is one way to think about popular. Total count over that period is another way. Both are reasonable operationalizations, though I agree that the second makes more sense.
But pedagogically, it is just meant to show two different data recipes side by side so you can see how the steps are put together to build a solution. Sometimes bad examples are the most instructive because it makes you think :-)
If we're trying to "identify the top 10 male names for Baby Boomers," only one of those methods is producing the correct result, and I think it's Method 2. If so, I'm confused about why Method 1 is included as an example.
Note that the two methods produced the same results in terms of a list of the 5 most popular names. Just a slightly-different ordering: {James, Michael, Robert, John, David}
Thanks so much for your answers! That makes much more sense now.
In the reading "Data Recipes Using Pipes," , we are shown two different ways of finding the most popular baby names for Boomer males. The two methods produce dramatically different numbers and different orders, but the reading says "We can see that these two approaches to answering our question give us slightly different results, but are pretty close."
Method 1:
Method 2:
I'm trying to figure out why they're so different. In Method 1, is it just selecting the highest year for each name and then ordering by that number? (As opposed to Method 2, where it appears to be summing up the total for each name over all of those years, then ordering by that number.) What does the "distinct" function do in Method 1? Does it only pass one row per name to the next function? If so, how does it select which row to pass on? I've read the documentation I can find about "distinct," but I still don't understand what it's doing here.
If we're trying to "identify the top 10 male names for Baby Boomers," only one of those methods is producing the correct result, and I think it's Method 2. If so, I'm confused about why Method 1 is included as an example. Thanks for any help you can offer!