Open sunaynagoel opened 5 years ago
The first two questions require logical statements, similar to past labs. You can do this question solely with a logical statement (define group as mondays, count TRUEs).
When you start to get into two dimensional questions (3-5) then dplyr will be necessary.
Some thing like this ?
sum (dat$day == "Mon", na.rm = T)
How do I use function ceiling(x) to round up a number or is there any other way ?
That is correct.
Can you please refresh the lab. I updated Part 1 to make this progression a little more clear.
Questions 1-2 are simple logical statements.
Question 3 is a compound logical statement.
Question 4 is a table (but using count()
in dplyr instead of table()
)
Questions 5-6 are summary statistics over groups:
Answers to questions 5-6 will take the following form.
dat %>% group_by( factor ) %>% summarize( my.stat = formula or logical statement )
1) How many accidents happen on Mondays? Sum over a logical statement 2) What proportion of accidents each week occur on Monday? Mean of a logical statement 3) What proportion of accidents on Mondays result in harm? Compound logical statement 4) What is the most common type of accident (Collisionmanner) that occurs on Mondays? Use dplyr’s count() function. 5) Are there differences in the proportion of accidents that result in harm each day of the week? Create a table of proportion of accidents that result in harm each day of the week Use group_by() and summarize() Note you can define custom summary statistics in summarize() using logical statements from above 6) Create a table that reports the following for each day of the week: Number of accidents Number of people hurt in accidents (total injuries) Number of people killed in accidents (total fatalities) Proportion of accidents resulting in harm (injuries + fatalities)
Like this?
> round( 1.5732, 2 )
[1] 1.57
> ceiling( 1.5732 )
[1] 2
> floor( 1.5732 )
[1] 1
Yes. But it did not work for me , may I have mismatch in type of data. It just returns 1. I was working on proportion question. Why do we need to use "mean " can't we do it like accidents on Monday / total accidents? It gives a very small number like 0.00000485 I was trying to present in a better way?
Are you applying the ceiling to a logical vector?
Yes, the proportion can be calculated as:
these.mondays <- logical statement == "Mondays"
sum( these.mondays ) / length( these.mondays )
#OR
mean( these.mondays)
I think , I see my mistake. I always get confused between when to use sum and when to use length.
It's like most things in programming. It's not obvious until it's obvious.
I can't make my arrange (desc (Collisionmanner)) %>% to work . It does not give any errors but it does not change anything in the output as well.
You don't need a pipe after the function unless you are sending the results to another function. Is that it?
I have pander () right after that.
Ah, you should arrange on the count, not the name. The names are already in order.
http://ds4ps.org/dp4ss-textbook/p-070-data-verbs.html#arrange-sorts-data
Still could not get it to work.
dat %>%
filter( day =="Mon" ) %>%
group_by( Collisionmanner ) %>%
count (Collisionmanner, name = " On Mondays")%>%
**#arrange (desc()) %>%**
pander()
What can go in the bold line ?
dplyr names the count variable n so you just need to reference that:
dat %>%
count( day, Collisionmanner ) %>%
filter( day == "Mon" ) %>%
arrange( desc(n) ) %>%
pander()
Note that in core R after using the table() function you can't reference cells.
table( dat$day, dat$Collisionmanner )
This is what makes dplyr quite powerful, also much easier to report specific statistics in small and nicely-formatted tables.
Part 2. Question #1 I am able to generate a table with age group and time but to make it more user-friendly I want to make it a grid form. I tries spread function and tapply. Its not working out. I am sure I am missing something.
It would be something like this:
dat %>%
count( day, Collisionmanner ) %>%
spread( key = Collisionmanner, value = n ) %>%
pander()
Or:
dat %>%
group_by( f1, f2 ) %>%
summarize( my.stat = formula or logical statement ) %>%
spread( f1, my.stat )
In core R this is something like (did not check argument names, typing from memory):
tapply( x=these.mondays, index=list(dat$day, dat$Collisiontype), fun=mean )
It says can's find function spread, do i need to load something for that?
For part 2 I copied the code provided to format the date and time, and it seems the times are not populating correctly. It is showing all crash events happening at either 1 AM or 1 PM. Here is the code I used:
date.vec <- strptime( dat$DateTime, format="%m/%d/%y %H:%M" )
dat$hour <- format( date.vec, format="%H" )
dat$month <- format( date.vec, format="%b" )
dat$day <- format( date.vec, format="%a" )
dat$day365 <- format( date.vec, format="%j" )
dat$week <- format( date.vec, format="%V" )
dat$day <- factor( dat$day, levels=c("Mon","Tue","Wed","Thu","Fri","Sat","Sun"))
table( dat$day ) %>% pander()
dat$hour12 <- format( date.vec, format="%1 %p" )
time.levels <- c("12 AM", "1 AM", "2 AM", "3 AM", "4 AM", "5 AM", "6 AM", "7 AM", "8 AM", "9 AM", "10 AM", "11 AM", "12 PM", "1 PM", "2 PM", "3 PM", "4 PM", "5 PM", "6 PM", "7 PM", "8 PM", "9 PM", "10 PM", "11 PM")
dat$hour12 <- factor( dat$hour12, levels=time.levels )
table( dat$hour12 ) %>% pander()
and here is the table it spits out.
------------------------------------------------------------------------------
12 AM 1 AM 2 AM 3 AM 4 AM 5 AM 6 AM 7 AM 8 AM 9 AM 10 AM
------- ------ ------ ------ ------ ------ ------ ------ ------ ------ -------
0 8989 0 0 0 0 0 0 0 0 0
------------------------------------------------------------------------------
Table: Table continues below
-------------------------------------------------------------------------------
11 AM 12 PM 1 PM 2 PM 3 PM 4 PM 5 PM 6 PM 7 PM 8 PM 9 PM
------- ------- ------- ------ ------ ------ ------ ------ ------ ------ ------
0 0 19481 0 0 0 0 0 0 0 0
-------------------------------------------------------------------------------
Is there another setting I forgot to include?
library( tidyr )
@etbartell Did you include:
date.vec <- strptime( dat$DateTime, format="%m/%d/%y %H:%M" )
dat$hour <- format( date.vec, format="%H" )
dat$month <- format( date.vec, format="%b" )
dat$day <- format( date.vec, format="%a" )
dat$day365 <- format( date.vec, format="%j" )
dat$week <- format( date.vec, format="%V" )
PS - please format your questions correctly using markdown.
@lecy Yes, I included that code as well. I added all the code I used to the original post above. And sorry, I'll make sure to format next time.
@etbartell i'm not sure, can you send your RMD file via email so I can see all of your steps?
@etbartell Ok, this is a subtle but important difference.
Here is the code from the lab instructions:
levels=c( "12 AM", " 1 AM", " 2 AM", " 3 AM", " 4 AM", " 5 AM",
" 6 AM", " 7 AM", " 8 AM", " 9 AM", "10 AM", "11 AM",
"12 PM", " 1 PM", " 2 PM", " 3 PM", " 4 PM", " 5 PM",
" 6 PM", " 7 PM", " 8 PM", " 9 PM", "10 PM", "11 PM" )
Here is your code:
time.levels <- c( "12 AM", "1 AM", "2 AM", "3 AM", "4 AM", "5 AM", "6 AM", "7 AM", "8 AM",
"9 AM", "10 AM", "11 AM", "12 PM", "1 PM", "2 PM", "3 PM", "4 PM", "5 PM", "6 PM",
"7 PM", "8 PM", "9 PM", "10 PM", "11 PM" )
Do you see the difference?
R interprets strings literally, so the words "ok" and "OK" and "Ok" are considered distinct. Similarly, " 1 AM" is different than "1 AM".
The line with the factor()
function was turning a character vector into an ordered factor so that in tables and graphs it will print the times in the correct order. The levels=time.levels argument assigns those orders. Any level that was not specified in time.levels was dropped, including all of the misspellings.
Does that make sense? I am guessing you typed these in by hand?
I am still trying to make sense of Part 2, #2. Anyone else? I know it is a multi step calculation but I have tried various permutation and combinations, nothing seems to be working. This as close as I could get. This one is giving error in summaries function, so I am not sure if it will generate the desired output or not.
dat %>%
group_by (age) %>%
mutate (agewise.total.accidents = length(Totalfatalities))%>%
group_by (hour12) %>%
agewise.acciednts <- sum (dat$Totalinjuries>0,na.rm = TRUE) + sum(dat$Totalfatalities > 0,na.rm = TRUE) %>%
summarize( age.prop = mean (agewise.acciednts | agewise.total.accidents )) %>%
spread(key = age, value = age.prop) %>%
pander()
Error Message - Error in UseMethod("summarise") : no applicable method for 'summarise' applied to an object of class "c('integer', 'numeric')"
OK, remember to start by writing down your data recipe, then translate to code.
Let's work with this sample dataset:
f1 | f2 | x |
---|---|---|
A | L-01 | 1 |
A | L-01 | 1 |
A | L-02 | 1 |
A | L-02 | 1 |
A | L-02 | 1 |
A | L-03 | 1 |
B | L-01 | 1 |
B | L-01 | 1 |
B | L-03 | 1 |
B | L-03 | 1 |
Instead of age and hour12, we will use factors f1 and f2. We want to know the proportion of each f2 within f1.
My data recipe, then, will be:
1) Calculate the frequency of each f1-f2 combination (each cell in our table). 2) Calculate the frequency of each f1 group. 3) Combine steps 1-2 into a single dataset 4) Divide each f1-f2 group count by the f1 group count.
d %>% count( f1, f2 )
# or
d %>% group_by( f1, f2 ) %>% summarize( n=n() )
f1 | f2 | n |
---|---|---|
A | L-01 | 2 |
A | L-02 | 3 |
A | L-03 | 1 |
B | L-01 | 2 |
B | L-03 | 2 |
d %>% count( f1 )
# or
d %>% group_by( f1 ) %>% summarize( n.f1 = n() )
f1 | n |
---|---|
A | 6 |
B | 4 |
The problem with Step 2 is that count() and summarize() drop all columns except the summary stats. We know, however, that mutate() creates new variables and keeps the rest of the dataset. So the trick is to replace summarize with mutate.
d %>% count( f1, f2 ) %>% group_by( f1 ) %>% mutate( n.f1 = n() )
f1 | f2 | n | n.f1 |
---|---|---|---|
A | L-01 | 2 | 3 |
A | L-02 | 3 | 3 |
A | L-03 | 1 | 3 |
B | L-01 | 2 | 2 |
B | L-03 | 2 | 2 |
That's not quite right because we are counting the occurrences of f1 in a summary table. We know that there are 6 f1's, not 3. Each row of f1 occurs n times in the original dataset. So we need to sum over n, not count rows.
d %>% count( f1, f2 ) %>% group_by( f1 ) %>% mutate( n.f1 = sum(n) )
f1 | f2 | n | n.f1 |
---|---|---|---|
A | L-01 | 2 | 6 |
A | L-02 | 3 | 6 |
A | L-03 | 1 | 6 |
B | L-01 | 2 | 4 |
B | L-03 | 2 | 4 |
That looks better!
The last step is then to divide (f1-f2) / f1.
d %>%
count( f1, f2 ) %>%
group_by( f1 ) %>%
mutate( n.f1 = sum(n) ) %>%
mutate( prop= n / n.f1 )
f1 | f2 | n | n.f1 | prop |
---|---|---|---|---|
A | L-01 | 2 | 6 | 0.33 |
A | L-02 | 3 | 6 | 0.50 |
A | L-03 | 1 | 6 | 0.17 |
B | L-01 | 2 | 4 | 0.50 |
B | L-03 | 2 | 4 | 0.50 |
We can then filter by a specific factor to answer a question like what age of driver is most dangerous at 7am?
d %>%
count( f1, f2 ) %>%
group_by( f1 ) %>%
mutate( n.f1 = sum(n) ) %>%
mutate( prop= round( n / n.f1, 2 ) ) %>%
filter( f2 == "L-01" )
f1 | f2 | n | n.f1 | prop |
---|---|---|---|---|
A | L-01 | 2 | 6 | 0.33 |
B | L-01 | 2 | 4 | 0.50 |
Your results should look something like this, with p.age representing the within-age proportions.
The p.age answers the question, who are the most dangerous drivers in the morning?
The p.hour answers the question, who is causing the most accidents at 7am (of all accidents at 7am, what proportion is each age group responsible for)?
age | hour12 | n | n.age | n.hour | p | p.age | p.hour |
---|---|---|---|---|---|---|---|
16-18 | 7 AM | 74 | 1391 | 1529 | 0.05 | 0.05 | 0.05 |
18-25 | 7 AM | 397 | 8400 | 1529 | 0.26 | 0.05 | 0.26 |
25-35 | 7 AM | 353 | 5194 | 1529 | 0.23 | 0.07 | 0.23 |
35-45 | 7 AM | 238 | 3085 | 1529 | 0.16 | 0.08 | 0.16 |
45-55 | 7 AM | 169 | 2528 | 1529 | 0.11 | 0.07 | 0.11 |
55-65 | 7 AM | 109 | 1777 | 1529 | 0.07 | 0.06 | 0.07 |
65-75 | 7 AM | 38 | 918 | 1529 | 0.02 | 0.04 | 0.02 |
75-100 | 7 AM | 15 | 538 | 1529 | 0.01 | 0.03 | 0.01 |
Maybe this is not surprising when we consider:
For some reason
filter ( hour12 == "7 AM" ) %>%
is not working. I tried with different times and noticed that it works for all the times with two digit ( for eg; 10 am, 11 am , 10 pm etc) and not with any time with single digit. Not sure what is the reason behind this.
You need to be really careful with strings. It is:
filter ( hour12 == " 7 AM" ) %>%
I see the difference now. I feel so silly for overlooking this.
It's subtle!
I have a question on Part 1: Question 6. The code I am using is this:
grouped.dat <- group_by (dat, day)
summarize(grouped.dat, n=n(),
injuries= sum(dat$Totalinjuries),
fatalities=sum(dat$Totalfatalities),
harm.rate=mean(injuries.fatalities) )
The table that it's returning had the correct number for n, grouped by days, but the other columns seem to be sums of all the days. So like each day has a total of 69 fatalities. Why aren't the last three columns breaking down by each day individually?
Note that dplyr functions have you reference the data set first, then variable names directly.
summarize( dat, total.x=sum( x ) )
You are using core R conventions by referencing your vectors with dat$, which means it's using the original dataset dat instead of your grouped dataset grouped.dat inside of the sum functions.
summarize( dat, total.x=sum( dat$x ) )
See the difference?
I have a question, am I on the right track with this code or am I not understanding what is happening? This is for question 5 part 1 Code:
dat$day <- factor( dat$day, levels=c("Mon","Tue","Wed","Thu","Fri","Sat","Sun") )
table( dat$day ) %>% pander()
dat %>% group_by( factor ) %>% summarize( my.stat = mean() )
This is also another code I tried doing for question 5 part 1, am I not getting this?
dat$day <- factor( dat$day, levels=c("Mon","Tue","Wed","Thu","Fri","Sat","Sun") )
table( dat$day ) %>% pander()
my.stat<- mean(dat$day)
dat %>% group_by( my.stat ) %>% summarize( my.stat )
Let me say it this way. The exercise is to help you think about what type of table you are creating:
Cases 1 and 2 are both variations of counts of rows that occur within each group (a proportion is a count of a group divided by a total count - but still just counting rows).
Case 3 is a little distinct because you first create a group, then calculate a statistic for the subset of data that belongs to each level in the group.
weekends <- dat$day == "SAT" | dat$day == "SUN"
# CASE 1 - COUNTS
sum( weekends ) # number of days in dataset falling on weekends
# counts of all days at once, core R and dplyr
table( dat$day )
count( dat, day )
# CASE 2 - PROPORTIONS
mean( weekends ) # proportion of dataset that falls on the weekend
# proportions of days
table( dat$day ) / nrow(dat)
dat %>% group_by( day ) %>% summarize( prop.day = n() / nrow(dat) )
# CASE 3 - SUMMARY STATS OVER GROUPS
dat %>% group_by( day ) %>% summarize( mean.x = mean(x) )
# over user-defined group
dat %>%
mutate( weekend = ( day == "SAT" | day == "SUN" ) ) %>%
group_by( weekend ) %>%
summarize( mean.x = mean(x) )
Note that Case 1 and Case 2 are only working with factors (days, weekends). If that's the case you will tend to use table()
or count()
.
Case 3 is working with both a factor and a numeric vector. You group by the factor, and calculate the statistic with your vector x. This is where group_by()
and summarize()
become really powerful.
@jmacost5 See the explanation on Case 1, 2, and 3 above.
I think you are conflating Case 2 and Case 3:
my.stat<- mean(dat$day)
Day is a categorical variable, so you would either count the number of cases in each group, or look at proportions of cases within each group.
You cannot calculate the mean of a group. You can look at a proportion if you first translate the group to a logical vector, though.
mean( dat$day == "WED" )
I understand what you did for Part 2, but now I'm having trouble putting that together with the plot function since it is not part of dplyr. If I try to create a dataset with just the counts from each time, then feed it into the plot function it gives me a whole lot of points (I'm guessing every point in the original dataset). I don't get how the summary dataset from dplyr gets converted into a graphic.
dat %>%
group_by( hour12 ) %>%
summarize( n.time = n() )
plot( as.numeric(dat$hour12), dat$n.time,
pch=19, type="p", cex=1, bty="n",
xlab="Hour of the Day",
ylab="Total Injuries or Fatalities",
main="Total Injuries or Fatalities by Hour of the Day")
Let me say it this way. The exercise is to help you think about what type of table you are creating:
- count of things by group
- proportion of things by group
- stat calculated from variable x for each level of group f1
Cases 1 and 2 are both variations of counts of rows that occur within each group (a proportion is a count of a group divided by a total count - but still just counting rows).
Case 3 is a little distinct because you first create a group, then calculate a statistic for the subset of data that belongs to each level in the group.
weekends <- dat$day == "SAT" | dat$day == "SUN" # CASE 1 - COUNTS sum( weekends ) # number of days in dataset falling on weekends # counts of all days at once, core R and dplyr table( dat$day ) count( dat, day ) # CASE 2 - PROPORTIONS mean( weekends ) # proportion of dataset that falls on the weekend # proportions of days table( dat$day ) / nrow(dat) dat %>% group_by( day ) %>% summarize( prop.day = n() / nrow(dat) ) # CASE 3 - SUMMARY STATS OVER GROUPS dat %>% group_by( day ) %>% summarize( mean.x = mean(x) ) # over user-defined group dat %>% mutate( weekend = ( day == "SAT" | day == "SUN" ) ) %>% group_by( weekend ) %>% summarize( mean.x = mean(x) )
Note that Case 1 and Case 2 are only working with factors (days, weekends). If that's the case you will tend to use
table()
orcount()
.Case 3 is working with both a factor and a numeric vector. You group by the factor, and calculate the statistic with your vector x. This is where
group_by()
andsummarize()
become really powerful.
I guess I am confused because I am not getting numbers
Mon | NA | | |
Tue | NA | | |
Wed | NA | | |
Thu | NA | | |
Fri | NA | | |
Sat | NA | | |
Sun | NA
^ the variable x is just a placeholder. You need to use a real variable in your dataset.
The plot() function requires two numeric vectors. I see you are trying to convert hour12 to numeric, which is fine since the factor is ordered, so 12AM will be 1, 1AM will be 2, 2AM will be 3, etc.
Your problem is that you are trying to plot the full dataset instead of the summary table. Try something like:
d2 <-
dat %>%
count( hour12 )
plot( as.numeric(d2$hour12), d2$n )
Or you can use the 24-hour version of time so that each group is the number 1 to 24, then you don't worry about the factor to numeric conversion going wrong.
dat %>% group_by( day ) %>% summarize( mean = mean(day, na.rm = T) )
This is my code. I guess I am not understanding what I should be putting in the chunk because I keep getting NA for the answers
Since I am trying to keep going I tried the next question and still did not understand how to store a value for harm.rate.
grouped.dat <- group_by (dat, day)
injuries.fatalities <- sum(dat$Totalinjuries & dat$Totalfatalities)/dat$day
summarize(grouped.dat, n=n(),
injuries= sum(Totalinjuries),
fatalities= sum(Totalfatalities),
harm.rate=mean(injuries.fatalities) )
I'm not sure what you are trying to do here:
mean = mean( day )
The average of a categorical variable is not defined. It would be like asking for the average of a word. That is why R is giving you NAs as your answer - it is not defined.
See the notes above for Case 1 and Case 2 - you can count categorical variables (observations per group) or take proportions (group 1 is y proportion of the total).
You can apply mathematical operators to numeric vectors:
dat %>%
group_by( day ) %>%
summarize( n.harmful.accidents = sum( Totalinjuries + Totalfatalities ) )
I'm not sure what you are doing here. How are you trying to define your logical vector for accidents that involve either injuries or fatalities?
injuries.fatalities <- sum( dat$Totalinjuries & dat$Totalfatalities )/ dat$day
Note you are not including any criteria here, like:
no.harm <- dat$Totalinjuries == 0 & dat$Totalfatalities == 0
And also you cannot divide by a categorical variable (dat$day). You would want to group_by then summarize.
To find harmful accidents, would you want an AND statement or OR statement above?
Since I am trying to keep going I tried the next question and still did not understand how to store a value for harm.rate.
Try:
grouped.dat <- group_by (dat, day)
d.summary <-
summarize( grouped.dat, n=n(),
injuries= sum(Totalinjuries),
fatalities= sum(Totalfatalities),
harm.rate=mean(injuries.fatalities) )
d.summary
Or more efficiently:
# print in RMD
dat %>%
group_by( day ) %>%
summarize( n=n(),
injuries= sum(Totalinjuries),
fatalities= sum(Totalfatalities),
harm.rate=mean(injuries.fatalities) ) %>%
pander()
# save
d.summary <-
dat %>%
group_by( day ) %>%
summarize(
n=n(),
injuries= sum(Totalinjuries),
fatalities= sum(Totalfatalities),
harm.rate=mean(injuries.fatalities) )
Ok, going on to part 2 question 1, I found that the table is doing this:
------------------------------------------------
Mon Tue Wed Thu Fri Sat Sun
------ ------ ------ ------ ------ ------ ------
4094 4656 4711 4814 5006 3044 2145
------------------------------------------------
------------------------------------------------------------------------------
12 AM 1 AM 2 AM 3 AM 4 AM 5 AM 6 AM 7 AM 8 AM 9 AM 10 AM
------- ------ ------ ------ ------ ------ ------ ------ ------ ------ -------
0 0 0 0 0 0 0 0 0 0 0
------------------------------------------------------------------------------
Table: Table continues below
------------------------------------------------------------------------------
11 AM 12 PM 1 PM 2 PM 3 PM 4 PM 5 PM 6 PM 7 PM 8 PM 9 PM
------- ------- ------ ------ ------ ------ ------ ------ ------ ------ ------
0 0 0 0 0 0 0 0 0 0 0
------------------------------------------------------------------------------
Table: Table continues below
---------------
10 PM 11 PM
------- -------
0 0
---------------
When my code is this:
date.vec <- strptime( dat$DateTime, format="%m/%d/%y %H:%M" )
dat$hour <- format( date.vec, format="%H" )
dat$month <- format( date.vec, format="%b" )
dat$day <- format( date.vec, format="%a" )
dat$day365 <- format( date.vec, format="%j" )
dat$week <- format( date.vec, format="%V" )
dat$day <- factor( dat$day, levels=c("Mon","Tue","Wed","Thu","Fri","Sat","Sun"))
table( dat$day ) %>% pander()
dat$hour12 <- format( date.vec, format="%1 %p" )
time.levels <- c( "12 AM", " 1 AM", " 2 AM", " 3 AM", " 4 AM", " 5 AM",
" 6 AM", " 7 AM", " 8 AM", " 9 AM", "10 AM", "11 AM",
"12 PM", " 1 PM", " 2 PM", " 3 PM", " 4 PM", " 5 PM",
" 6 PM", " 7 PM", " 8 PM", " 9 PM", "10 PM", "11 PM" )
dat$hour12 <- factor( dat$hour12, levels=time.levels )
table( dat$hour12 ) %>% pander()
This should be a lower-case "L" not a one:
format="%1 %p"
@jmacost5 please be sure to put fences ``` around your code and output in the questions so it is easier to read.
Ok sorry I will try for this question. For part 2 number 2 I looked at your example code and I tried my best to do it with the other variables with the table.
dat %>%
count (age,hour12)%>%
group_by (age) %>%
mutate (n.age=sum(age))%>%
mutate (n.hour=sum(hour12))%>%
mutate(p.age=( n / n.age))
mutate(p.hour= (n/n.hour))
mutate(mean(age+hour12))
summarize(age,hour12,n=n()) %>%
pander()
My next question is are we supposed to make graphs similar to the ones in the last part are we suppose to make our own graph?
The sum()
function is a math function so it is used with a numeric vector. I think you want n.age=n()
to count cases after grouping.
dat %>%
count( age, hour12 )%>% # this one creates n, the count for each cell
group_by( age ) %>%
mutate ( n.age=n() )%>%
mutate( p.age=( n / n.age ) ) %>%
##############
# data is currently grouped by age, so n.hour=n() will count ages not hours
# mutate ( n.hour=n() )%>%
# mutate( p.hour= ( n / n.hour ) ) %>%
#############
# mutate( mean( age+hour12 ) ) # no idea what this is supposed to do?
# summarize( age, hour12, n=n() ) %>%
pander()
Stuck on very very first question.
1 How many accidents happen on Mondays?
I can use basic R to answer the question but it's hard to think in terms of dplyr.