Week 5’s second reading “Data recipes using pipes”

CynthiaLteif commented 2 years ago

I have a lot of question in week 5’s second reading “Data recipes using pipes”. Apologies in advance for the long post.

In “building a recipe to identify the top 10 male names of baby boomers” To keep only the most popular year for each name, we used: distinct(Name, .keep_all = T). How is this this keep ONLY the MOST popular year? What does .keep_all = T OR F do?
In the alternative approach of the “building a recipe to identify the top 10 male names of baby boomers” To count the total number of men given each name during this period we used: dplyr: summarize(total = sum(count)). Could we have used the mutuate(total = sum(count)) function instead?
Why do we use ungroup() after group_by()?
In ggvis package a. Does stroke = ~ identify the variable we’re examining? b. Does layer_lines() identify the type of graph? If yes, are there other types that coule be used?
In the hipster.names section, I did not understand how the calculations were done in the summarize(total = sum(count), peak = max(count)) function. a. Is the total summing the count in df1, df2, and df3 for each name per year? b. Is the peak choosing from the count, the highest count belonging to a specific year?

Thank you in advance! @gmcirco

gmcirco commented 2 years ago

@CynthiaLteif Lots of good questions here:

Let's look at the code for this question:

names %>% filter( Gender =="M" & Year >= 1946 & Year <= 1964 ) %>% arrange( desc( Count ) ) %>% distinct( Name, .keep_all=T ) %>% top_n( 10, Count ) %>% select( Name, Year, Count ) %>% pander()

There are lot of things happening here. arrange( desc( Count ) ) sorts the names from highest to lowest by their counts. distinct( Name, .keep_all=T ) gets the unique names PER YEAR. You will notice that the names are sorted by Year-Name, so we only want the unique name-year combination (for example, James in 1947). The .keep = T just tells R to retain all the columns because by default distinct will throw away any columns you don't specify.

You can use mutate, but in this case we want to combine the rows together. Mutate will give you a column with the counts, but it will be repeated over the groups you have in group_by
It is often good practice to ungroup after using group_by because in R any subsequent functions on that dataframe will be grouped. We might not want to perform a group operation, so ungrouping makes sure that you are starting on a normal dataframe.
That's correct. The tilde refers to specific variables that are being mapped to the visualization. You then add additional layers afterwards. layer_lines() is probably the most appropriate one here, but you can see it would look different if you added a bar chart via layer_bars().
The example in this section first finds subsets of names that meet certain criteria in df1, df2, and df3 (corresponding to: " (1) They were popular when your grandmother was young. (2) They were unpopular when your parent were young. (3) They have recently become popular again." ) You are correct that they are the name-year combinations. The peak variable finds the value, for each name, that it reached its highest value that year.

CynthiaLteif commented 2 years ago

Thank you @gmcirco. Much clearer. I have a couple of questions relating to Lab #5 Part 2, question 2.

I have followed the same steps as given in the "instructions" example yet I'm receiving the "n" to be completely different which is of course affecting the proportion. Instead of n = 77 for age = 16-18 at 7AM, I'm getting 157 (which is identical to sum(dat$age == "Age 16-18" & dat$hour12 == "7 AM", na.rm = TRUE). What is the n = 77 representing?
What are n.hour, p, and p.hour and why are we expected to find them? (considering the question is reporting the proportion of accidents at "7 AM" for each age group.)

CynthiaLteif commented 2 years ago

@gmcirco using "hour" instead of "hour12" fixed the issue. Thanks! But I'm still unsure why it's needed to find n.hour and p.hour.

gmcirco commented 2 years ago

@CynthiaLteif

The question is a bit ambiguous about this, but the intention is that:

p = proportion of accidents for each age group at that hour n.hour = number of accidents for that hour p.hour = proportion of accidents for that hour

CynthiaLteif commented 2 years ago

If I understand correctly, the formulas for p and p.hour are identical, correct? @gmcirco

gmcirco commented 2 years ago

Yes, that is correct! (Unusual but correct. Personally, I do not like this question very much)

On Wed, Feb 9, 2022, 3:17 PM Cynthia @.***> wrote:

If I understand correctly, the formulas for p and p.hour are identical, correct? @gmcirco https://github.com/gmcirco

— Reply to this email directly, view it on GitHub https://github.com/Watts-College/cpp-526-spr-2022/issues/32#issuecomment-1034155151, or unsubscribe https://github.com/notifications/unsubscribe-auth/ATZ5TJ46Z7KARCHDAJQ5FULU2LDVVANCNFSM5NVKPD2A . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you were mentioned.Message ID: @.***>

CynthiaLteif commented 2 years ago

Thank you very much. Was my least favorite too, haha! @gmcirco

Watts-College / cpp-526-spr-2022

Week 5’s second reading “Data recipes using pipes” #32