DS4PS / cpp-526-sum-2021

Coure shell for CPP 526.
https://ds4ps.org/cpp-526-sum-2021/
MIT License
1 stars 3 forks source link

Difference in counts in "Data Recipes Using Pipes" reading for Week 5 #29

Open mtwelker opened 3 years ago

mtwelker commented 3 years ago

In the reading "Data Recipes Using Pipes," , we are shown two different ways of finding the most popular baby names for Boomer males. The two methods produce dramatically different numbers and different orders, but the reading says "We can see that these two approaches to answering our question give us slightly different results, but are pretty close."

Method 1: image

Method 2: image

I'm trying to figure out why they're so different. In Method 1, is it just selecting the highest year for each name and then ordering by that number? (As opposed to Method 2, where it appears to be summing up the total for each name over all of those years, then ordering by that number.) What does the "distinct" function do in Method 1? Does it only pass one row per name to the next function? If so, how does it select which row to pass on? I've read the documentation I can find about "distinct," but I still don't understand what it's doing here.
If we're trying to "identify the top 10 male names for Baby Boomers," only one of those methods is producing the correct result, and I think it's Method 2. If so, I'm confused about why Method 1 is included as an example. Thanks for any help you can offer!

jamisoncrawford commented 3 years ago

Hey @mtwelker! Just by looking at the code but not rerunning anything, it looks as though Method 1 isn't actually grouping or summarizing anything, but rather shows the name, year, and n. In Method 2, it i showing you the aggregated n over 1946 - 1964, rather than parsed by individual year (like Method 1). There should be a lot more rows in the first method because of all the unique years.

Here's the documentation for distinct(): https://dplyr.tidyverse.org/reference/distinct.html

Really, that's just eliminating duplicate rows if I understand the documentation correctly. I don't use distinct() often in R, but do use it quite a bit in SQL.

If we're trying to "identify the top 10 male names for Baby Boomers," only one of those methods is producing the correct result, and I think it's Method 2. If so, I'm confused about why Method 1 is included as an example.

I'm not quite sure, either!

lecy commented 3 years ago

Goal is to find the most popular names during that period, so we don't care much about the counts other than to sort the list.

The main difference between the two approaches is that the first identifies counts by year for men born within the Boomer window, then sorts individual years, then keeps the first occurrence (the peak year for each name) and drops the rest with the distinct() function.

For example, James peaks in popularity in 1947:

> names %>% 
+   filter( Gender =="M" & Year >= 1946 & Year <= 1964 & Name == "James" ) %>%
+   pander()

----------------------------------------
   Id     Name    Year   Gender   Count 
-------- ------- ------ -------- -------
 427037   James   1946     M      87425 

 437158   James   1947     M      94755 

 447462   James   1948     M      88596 

 457723   James   1949     M      86856 

 468034   James   1950     M      86221 

 478442   James   1951     M      87175 

 489081   James   1952     M      87083 

 499839   James   1953     M      85946 

 510786   James   1954     M      86277 

 521861   James   1955     M      84130 

 533141   James   1956     M      84860 

 544604   James   1957     M      84242 

 556179   James   1958     M      78731 

 567869   James   1959     M      78597 

 579777   James   1960     M      76872 

 591897   James   1961     M      75896 

 604128   James   1962     M      72563 

 616410   James   1963     M      71332 

 628831   James   1964     M      73050 
----------------------------------------

> names %>% 
+   filter( Gender =="M" & Year >= 1946 & Year <= 1964 & Name == "James" ) %>%
+   arrange( desc( Count ) ) %>%
+   distinct( Name, .keep_all=T ) %>%
+   pander()

----------------------------------------
   Id     Name    Year   Gender   Count 
-------- ------- ------ -------- -------
 437158   James   1947     M      94755 
----------------------------------------

The second aggregates all instances for each name: group_by( Name ) %>% summarize( total=sum(Count) )

"Slightly different solutions but close" is referring to the final lists of names returned by each method. They both return the same set of five names. It's not referring to the counts.

As far as which is better, it just depends on how you operationalize the instructions "identify the most popular names given to Boomer men". Highest count in a given year in that window is one way to think about popular. Total count over that period is another way. Both are reasonable operationalizations, though I agree that the second makes more sense.

But pedagogically, it is just meant to show two different data recipes side by side so you can see how the steps are put together to build a solution. Sometimes bad examples are the most instructive because it makes you think :-)

lecy commented 3 years ago

If we're trying to "identify the top 10 male names for Baby Boomers," only one of those methods is producing the correct result, and I think it's Method 2. If so, I'm confused about why Method 1 is included as an example.

Note that the two methods produced the same results in terms of a list of the 5 most popular names. Just a slightly-different ordering: {James, Michael, Robert, John, David}

image

image

mtwelker commented 3 years ago

Thanks so much for your answers! That makes much more sense now.