DS4PS / cpp-526-fall-2019

Course material for CPP 526 Foundations of Data Science I
http://ds4ps.org/cpp-526-fall-2019
4 stars 4 forks source link

Lab 05 #20

Open sunaynagoel opened 5 years ago

sunaynagoel commented 5 years ago

Stuck on very very first question.

1 How many accidents happen on Mondays?

I can use basic R to answer the question but it's hard to think in terms of dplyr.

lecy commented 5 years ago

The first two questions require logical statements, similar to past labs. You can do this question solely with a logical statement (define group as mondays, count TRUEs).

When you start to get into two dimensional questions (3-5) then dplyr will be necessary.

sunaynagoel commented 5 years ago

Some thing like this ?

sum (dat$day == "Mon", na.rm = T)

sunaynagoel commented 5 years ago

How do I use function ceiling(x) to round up a number or is there any other way ?

lecy commented 5 years ago

That is correct.

Can you please refresh the lab. I updated Part 1 to make this progression a little more clear.

Questions 1-2 are simple logical statements. Question 3 is a compound logical statement. Question 4 is a table (but using count() in dplyr instead of table() ) Questions 5-6 are summary statistics over groups:

Answers to questions 5-6 will take the following form.

dat %>% group_by( factor ) %>% summarize( my.stat = formula or logical statement )

1) How many accidents happen on Mondays? Sum over a logical statement 2) What proportion of accidents each week occur on Monday? Mean of a logical statement 3) What proportion of accidents on Mondays result in harm? Compound logical statement 4) What is the most common type of accident (Collisionmanner) that occurs on Mondays? Use dplyr’s count() function. 5) Are there differences in the proportion of accidents that result in harm each day of the week? Create a table of proportion of accidents that result in harm each day of the week Use group_by() and summarize() Note you can define custom summary statistics in summarize() using logical statements from above 6) Create a table that reports the following for each day of the week: Number of accidents Number of people hurt in accidents (total injuries) Number of people killed in accidents (total fatalities) Proportion of accidents resulting in harm (injuries + fatalities)

lecy commented 5 years ago

Like this?

> round( 1.5732, 2 )
[1] 1.57
> ceiling( 1.5732 )
[1] 2
> floor( 1.5732 )
[1] 1
sunaynagoel commented 5 years ago

Yes. But it did not work for me , may I have mismatch in type of data. It just returns 1. I was working on proportion question. Why do we need to use "mean " can't we do it like accidents on Monday / total accidents? It gives a very small number like 0.00000485 I was trying to present in a better way?

lecy commented 5 years ago

Are you applying the ceiling to a logical vector?

Yes, the proportion can be calculated as:

these.mondays <- logical statement == "Mondays"
sum( these.mondays ) / length( these.mondays )
#OR 
mean( these.mondays)
sunaynagoel commented 5 years ago

I think , I see my mistake. I always get confused between when to use sum and when to use length.

lecy commented 5 years ago

It's like most things in programming. It's not obvious until it's obvious.

sunaynagoel commented 5 years ago

I can't make my arrange (desc (Collisionmanner)) %>% to work . It does not give any errors but it does not change anything in the output as well.

lecy commented 5 years ago

You don't need a pipe after the function unless you are sending the results to another function. Is that it?

sunaynagoel commented 5 years ago

I have pander () right after that.

lecy commented 5 years ago

Ah, you should arrange on the count, not the name. The names are already in order.

http://ds4ps.org/dp4ss-textbook/p-070-data-verbs.html#arrange-sorts-data

sunaynagoel commented 5 years ago

Still could not get it to work.

dat %>% 
  filter( day =="Mon" ) %>%
  group_by( Collisionmanner ) %>%
  count (Collisionmanner, name = " On Mondays")%>%
  **#arrange (desc()) %>%**
  pander()

What can go in the bold line ?

lecy commented 5 years ago

dplyr names the count variable n so you just need to reference that:

dat %>%
  count( day, Collisionmanner ) %>%
  filter( day == "Mon" ) %>%
  arrange( desc(n) ) %>%
  pander()

Note that in core R after using the table() function you can't reference cells.

table( dat$day, dat$Collisionmanner )

This is what makes dplyr quite powerful, also much easier to report specific statistics in small and nicely-formatted tables.

sunaynagoel commented 5 years ago

Part 2. Question #1 I am able to generate a table with age group and time but to make it more user-friendly I want to make it a grid form. I tries spread function and tapply. Its not working out. I am sure I am missing something.

lecy commented 5 years ago

It would be something like this:

dat %>%
    count( day, Collisionmanner ) %>%
    spread( key = Collisionmanner, value = n ) %>% 
  pander()

Or:

dat %>% 
  group_by( f1, f2 ) %>% 
  summarize( my.stat = formula or logical statement ) %>%
  spread( f1, my.stat ) 

In core R this is something like (did not check argument names, typing from memory):

tapply( x=these.mondays, index=list(dat$day, dat$Collisiontype), fun=mean )
sunaynagoel commented 5 years ago

It says can's find function spread, do i need to load something for that?

etbartell commented 5 years ago

For part 2 I copied the code provided to format the date and time, and it seems the times are not populating correctly. It is showing all crash events happening at either 1 AM or 1 PM. Here is the code I used:

date.vec <- strptime( dat$DateTime, format="%m/%d/%y %H:%M" )

dat$hour   <- format( date.vec, format="%H" )
dat$month  <- format( date.vec, format="%b" )
dat$day    <- format( date.vec, format="%a" )
dat$day365 <- format( date.vec, format="%j" )
dat$week   <- format( date.vec, format="%V" )

dat$day <- factor( dat$day, levels=c("Mon","Tue","Wed","Thu","Fri","Sat","Sun"))
table( dat$day ) %>% pander()

dat$hour12 <- format( date.vec, format="%1 %p" )
time.levels <- c("12 AM", "1 AM", "2 AM", "3 AM", "4 AM", "5 AM", "6 AM", "7 AM", "8 AM", "9 AM", "10 AM", "11 AM", "12 PM", "1 PM", "2 PM", "3 PM", "4 PM", "5 PM", "6 PM", "7 PM", "8 PM", "9 PM", "10 PM", "11 PM")
dat$hour12 <- factor( dat$hour12, levels=time.levels )
table( dat$hour12 ) %>% pander()

and here is the table it spits out.

------------------------------------------------------------------------------
 12 AM   1 AM   2 AM   3 AM   4 AM   5 AM   6 AM   7 AM   8 AM   9 AM   10 AM 
------- ------ ------ ------ ------ ------ ------ ------ ------ ------ -------
   0     8989    0      0      0      0      0      0      0      0       0   
------------------------------------------------------------------------------

Table: Table continues below

-------------------------------------------------------------------------------
 11 AM   12 PM   1 PM    2 PM   3 PM   4 PM   5 PM   6 PM   7 PM   8 PM   9 PM 
------- ------- ------- ------ ------ ------ ------ ------ ------ ------ ------
   0       0     19481    0      0      0      0      0      0      0      0   
-------------------------------------------------------------------------------

Is there another setting I forgot to include?

lecy commented 5 years ago

library( tidyr )

lecy commented 5 years ago

@etbartell Did you include:

date.vec <- strptime( dat$DateTime, format="%m/%d/%y %H:%M" )

dat$hour   <- format( date.vec, format="%H" )
dat$month  <- format( date.vec, format="%b" )
dat$day    <- format( date.vec, format="%a" )
dat$day365 <- format( date.vec, format="%j" )
dat$week   <- format( date.vec, format="%V" )

PS - please format your questions correctly using markdown.

etbartell commented 5 years ago

@lecy Yes, I included that code as well. I added all the code I used to the original post above. And sorry, I'll make sure to format next time.

lecy commented 5 years ago

@etbartell i'm not sure, can you send your RMD file via email so I can see all of your steps?

lecy commented 5 years ago

@etbartell Ok, this is a subtle but important difference. 

Here is the code from the lab instructions:

levels=c( "12 AM", " 1 AM", " 2 AM", " 3 AM", " 4 AM", " 5 AM", 
" 6 AM", " 7 AM", " 8 AM", " 9 AM", "10 AM", "11 AM", 
"12 PM", " 1 PM", " 2 PM", " 3 PM", " 4 PM", " 5 PM", 
" 6 PM", " 7 PM", " 8 PM", " 9 PM", "10 PM", "11 PM" ) 

Here is your code:

time.levels <- c( "12 AM", "1 AM", "2 AM", "3 AM", "4 AM", "5 AM", "6 AM", "7 AM", "8 AM", 
"9 AM", "10 AM", "11 AM", "12 PM", "1 PM", "2 PM", "3 PM", "4 PM", "5 PM", "6 PM", 
"7 PM", "8 PM", "9 PM", "10 PM", "11 PM" )

Do you see the difference? 

R interprets strings literally, so the words "ok" and "OK" and "Ok" are considered distinct. Similarly, " 1 AM" is different than "1 AM". 

The line with the factor() function was turning a character vector into an ordered factor so that in tables and graphs it will print the times in the correct order. The levels=time.levels argument assigns those orders. Any level that was not specified in time.levels was dropped, including all of the misspellings

Does that make sense? I am guessing you typed these in by hand?

sunaynagoel commented 5 years ago

I am still trying to make sense of Part 2, #2. Anyone else? I know it is a multi step calculation but I have tried various permutation and combinations, nothing seems to be working. This as close as I could get. This one is giving error in summaries function, so I am not sure if it will generate the desired output or not.

dat %>%
group_by (age) %>%
mutate (agewise.total.accidents = length(Totalfatalities))%>%
group_by (hour12) %>%
agewise.acciednts <- sum (dat$Totalinjuries>0,na.rm = TRUE) + sum(dat$Totalfatalities > 0,na.rm = TRUE) %>%
summarize( age.prop = mean (agewise.acciednts | agewise.total.accidents )) %>% 
spread(key = age, value = age.prop) %>%
pander()

Error Message - Error in UseMethod("summarise") : no applicable method for 'summarise' applied to an object of class "c('integer', 'numeric')"

lecy commented 5 years ago

OK, remember to start by writing down your data recipe, then translate to code.

Let's work with this sample dataset:

f1 f2 x
A L-01 1
A L-01 1
A L-02 1
A L-02 1
A L-02 1
A L-03 1
B L-01 1
B L-01 1
B L-03 1
B L-03 1

Instead of age and hour12, we will use factors f1 and f2. We want to know the proportion of each f2 within f1.

My data recipe, then, will be:

1) Calculate the frequency of each f1-f2 combination (each cell in our table). 2) Calculate the frequency of each f1 group. 3) Combine steps 1-2 into a single dataset 4) Divide each f1-f2 group count by the f1 group count.

Step 1: count f1-f2 groups

d %>% count( f1, f2 )
# or
d %>% group_by( f1, f2 ) %>% summarize( n=n() )
f1 f2 n
A L-01 2
A L-02 3
A L-03 1
B L-01 2
B L-03 2

Step 2: count f1 groups

d %>% count( f1 )
# or 
d %>% group_by( f1 ) %>% summarize( n.f1 = n() )
f1 n
A 6
B 4

Step 3: Combine steps 1-2 into a single dataset

The problem with Step 2 is that count() and summarize() drop all columns except the summary stats. We know, however, that mutate() creates new variables and keeps the rest of the dataset. So the trick is to replace summarize with mutate.

d %>% count( f1, f2 ) %>% group_by( f1 ) %>% mutate( n.f1 = n() )
f1 f2 n n.f1
A L-01 2 3
A L-02 3 3
A L-03 1 3
B L-01 2 2
B L-03 2 2

That's not quite right because we are counting the occurrences of f1 in a summary table. We know that there are 6 f1's, not 3. Each row of f1 occurs n times in the original dataset. So we need to sum over n, not count rows.

d %>% count( f1, f2 ) %>% group_by( f1 ) %>% mutate( n.f1 = sum(n) )
f1 f2 n n.f1
A L-01 2 6
A L-02 3 6
A L-03 1 6
B L-01 2 4
B L-03 2 4

That looks better!

Step 4: Divide each f1-f2 group count by the f1 group count.

The last step is then to divide (f1-f2) / f1.

d %>% 
  count( f1, f2 ) %>% 
  group_by( f1 ) %>% 
  mutate( n.f1 = sum(n) ) %>% 
  mutate( prop= n / n.f1 )
f1 f2 n n.f1 prop
A L-01 2 6 0.33
A L-02 3 6 0.50
A L-03 1 6 0.17
B L-01 2 4 0.50
B L-03 2 4 0.50

We can then filter by a specific factor to answer a question like what age of driver is most dangerous at 7am?

d %>% 
  count( f1, f2 ) %>% 
  group_by( f1 ) %>% 
  mutate( n.f1 = sum(n) ) %>% 
  mutate( prop= round( n / n.f1, 2 ) ) %>% 
  filter( f2 == "L-01" )
f1 f2 n n.f1 prop
A L-01 2 6 0.33
B L-01 2 4 0.50
lecy commented 5 years ago

Your results should look something like this, with p.age representing the within-age proportions.

The p.age answers the question, who are the most dangerous drivers in the morning?

The p.hour answers the question, who is causing the most accidents at 7am (of all accidents at 7am, what proportion is each age group responsible for)?

age hour12 n n.age n.hour p p.age p.hour
16-18 7 AM 74 1391 1529 0.05 0.05 0.05
18-25 7 AM 397 8400 1529 0.26 0.05 0.26
25-35 7 AM 353 5194 1529 0.23 0.07 0.23
35-45 7 AM 238 3085 1529 0.16 0.08 0.16
45-55 7 AM 169 2528 1529 0.11 0.07 0.11
55-65 7 AM 109 1777 1529 0.07 0.06 0.07
65-75 7 AM 38 918 1529 0.02 0.04 0.02
75-100 7 AM 15 538 1529 0.01 0.03 0.01

image

Maybe this is not surprising when we consider:

sunaynagoel commented 5 years ago

For some reason filter ( hour12 == "7 AM" ) %>% is not working. I tried with different times and noticed that it works for all the times with two digit ( for eg; 10 am, 11 am , 10 pm etc) and not with any time with single digit. Not sure what is the reason behind this.

lecy commented 5 years ago

You need to be really careful with strings. It is:

filter ( hour12 == " 7 AM" ) %>% 
sunaynagoel commented 5 years ago

I see the difference now. I feel so silly for overlooking this.

lecy commented 5 years ago

It's subtle!

JaesaR commented 5 years ago

I have a question on Part 1: Question 6. The code I am using is this:

grouped.dat <- group_by (dat, day)
summarize(grouped.dat, n=n(), 
    injuries= sum(dat$Totalinjuries), 
    fatalities=sum(dat$Totalfatalities), 
    harm.rate=mean(injuries.fatalities) )

The table that it's returning had the correct number for n, grouped by days, but the other columns seem to be sums of all the days. So like each day has a total of 69 fatalities. Why aren't the last three columns breaking down by each day individually?

lecy commented 5 years ago

Note that dplyr functions have you reference the data set first, then variable names directly.

summarize( dat, total.x=sum( x ) )

You are using core R conventions by referencing your vectors with dat$, which means it's using the original dataset dat instead of your grouped dataset grouped.dat inside of the sum functions.

summarize( dat, total.x=sum( dat$x ) )

See the difference?

jmacost5 commented 5 years ago

I have a question, am I on the right track with this code or am I not understanding what is happening? This is for question 5 part 1 Code:

dat$day <- factor( dat$day, levels=c("Mon","Tue","Wed","Thu","Fri","Sat","Sun") )
table( dat$day ) %>% pander()
dat %>% group_by( factor ) %>% summarize( my.stat = mean() )
jmacost5 commented 5 years ago

This is also another code I tried doing for question 5 part 1, am I not getting this?

dat$day <- factor( dat$day, levels=c("Mon","Tue","Wed","Thu","Fri","Sat","Sun") )
table( dat$day ) %>% pander()
my.stat<- mean(dat$day)
dat %>% group_by( my.stat ) %>% summarize( my.stat )
lecy commented 5 years ago

Let me say it this way. The exercise is to help you think about what type of table you are creating:

  1. count of things by group
  2. proportion of things by group
  3. stat calculated from variable x for each level of group f1

Cases 1 and 2 are both variations of counts of rows that occur within each group (a proportion is a count of a group divided by a total count - but still just counting rows).

Case 3 is a little distinct because you first create a group, then calculate a statistic for the subset of data that belongs to each level in the group.

weekends <- dat$day == "SAT" | dat$day == "SUN"

# CASE 1 - COUNTS

sum( weekends )  # number of days in dataset falling on weekends 
# counts of all days at once, core R and dplyr 
table( dat$day )
count( dat, day )

# CASE 2 - PROPORTIONS 

mean( weekends ) # proportion of dataset that falls on the weekend 
# proportions of days
table( dat$day ) / nrow(dat)
dat %>% group_by( day ) %>% summarize( prop.day = n() / nrow(dat) )

# CASE 3 - SUMMARY STATS OVER GROUPS

dat %>% group_by( day ) %>% summarize( mean.x = mean(x) )

# over user-defined group
dat %>%
  mutate( weekend = ( day == "SAT" | day == "SUN" ) ) %>%
  group_by( weekend ) %>%
  summarize( mean.x = mean(x) )

Note that Case 1 and Case 2 are only working with factors (days, weekends). If that's the case you will tend to use table() or count().

Case 3 is working with both a factor and a numeric vector. You group by the factor, and calculate the statistic with your vector x. This is where group_by() and summarize() become really powerful.

lecy commented 5 years ago

@jmacost5 See the explanation on Case 1, 2, and 3 above.

I think you are conflating Case 2 and Case 3:

my.stat<- mean(dat$day)

Day is a categorical variable, so you would either count the number of cases in each group, or look at proportions of cases within each group.

You cannot calculate the mean of a group. You can look at a proportion if you first translate the group to a logical vector, though.

mean( dat$day == "WED" )
etbartell commented 5 years ago

I understand what you did for Part 2, but now I'm having trouble putting that together with the plot function since it is not part of dplyr. If I try to create a dataset with just the counts from each time, then feed it into the plot function it gives me a whole lot of points (I'm guessing every point in the original dataset). I don't get how the summary dataset from dplyr gets converted into a graphic.

dat %>%
  group_by( hour12 ) %>%
  summarize( n.time = n() )

plot( as.numeric(dat$hour12), dat$n.time, 
      pch=19, type="p", cex=1, bty="n",
      xlab="Hour of the Day", 
      ylab="Total Injuries or Fatalities",
      main="Total Injuries or Fatalities by Hour of the Day")
jmacost5 commented 5 years ago

Let me say it this way. The exercise is to help you think about what type of table you are creating:

  1. count of things by group
  2. proportion of things by group
  3. stat calculated from variable x for each level of group f1

Cases 1 and 2 are both variations of counts of rows that occur within each group (a proportion is a count of a group divided by a total count - but still just counting rows).

Case 3 is a little distinct because you first create a group, then calculate a statistic for the subset of data that belongs to each level in the group.

weekends <- dat$day == "SAT" | dat$day == "SUN"

# CASE 1 - COUNTS

sum( weekends )  # number of days in dataset falling on weekends 
# counts of all days at once, core R and dplyr 
table( dat$day )
count( dat, day )

# CASE 2 - PROPORTIONS 

mean( weekends ) # proportion of dataset that falls on the weekend 
# proportions of days
table( dat$day ) / nrow(dat)
dat %>% group_by( day ) %>% summarize( prop.day = n() / nrow(dat) )

# CASE 3 - SUMMARY STATS OVER GROUPS

dat %>% group_by( day ) %>% summarize( mean.x = mean(x) )

# over user-defined group
dat %>%
  mutate( weekend = ( day == "SAT" | day == "SUN" ) ) %>%
  group_by( weekend ) %>%
  summarize( mean.x = mean(x) )

Note that Case 1 and Case 2 are only working with factors (days, weekends). If that's the case you will tend to use table() or count().

Case 3 is working with both a factor and a numeric vector. You group by the factor, and calculate the statistic with your vector x. This is where group_by() and summarize() become really powerful.

I guess I am confused because I am not getting numbers

Mon | NA |   |   |  
Tue | NA |   |   |  
Wed | NA |   |   |  
Thu | NA |   |   |  
Fri | NA |   |   |  
Sat | NA |   |   |  
Sun | NA

^ the variable x is just a placeholder. You need to use a real variable in your dataset.

lecy commented 5 years ago

The plot() function requires two numeric vectors. I see you are trying to convert hour12 to numeric, which is fine since the factor is ordered, so 12AM will be 1, 1AM will be 2, 2AM will be 3, etc.

Your problem is that you are trying to plot the full dataset instead of the summary table. Try something like:

d2 <- 
dat %>%
  count( hour12 )

plot( as.numeric(d2$hour12), d2$n )

Or you can use the 24-hour version of time so that each group is the number 1 to 24, then you don't worry about the factor to numeric conversion going wrong.

jmacost5 commented 5 years ago
dat %>% group_by( day ) %>% summarize( mean = mean(day, na.rm = T) ) 

This is my code. I guess I am not understanding what I should be putting in the chunk because I keep getting NA for the answers

jmacost5 commented 5 years ago

Since I am trying to keep going I tried the next question and still did not understand how to store a value for harm.rate.

grouped.dat <- group_by (dat, day)
injuries.fatalities <- sum(dat$Totalinjuries & dat$Totalfatalities)/dat$day
summarize(grouped.dat, n=n(), 
    injuries= sum(Totalinjuries), 
    fatalities= sum(Totalfatalities),
    harm.rate=mean(injuries.fatalities) )
lecy commented 5 years ago

I'm not sure what you are trying to do here:

mean = mean( day )

The average of a categorical variable is not defined. It would be like asking for the average of a word. That is why R is giving you NAs as your answer - it is not defined.

See the notes above for Case 1 and Case 2 - you can count categorical variables (observations per group) or take proportions (group 1 is y proportion of the total).

You can apply mathematical operators to numeric vectors:

dat %>% 
  group_by( day ) %>% 
  summarize( n.harmful.accidents = sum( Totalinjuries + Totalfatalities ) ) 

I'm not sure what you are doing here. How are you trying to define your logical vector for accidents that involve either injuries or fatalities?

injuries.fatalities <- sum( dat$Totalinjuries & dat$Totalfatalities )/ dat$day

Note you are not including any criteria here, like:

no.harm <- dat$Totalinjuries == 0 & dat$Totalfatalities == 0

And also you cannot divide by a categorical variable (dat$day). You would want to group_by then summarize.

To find harmful accidents, would you want an AND statement or OR statement above?

lecy commented 5 years ago

Since I am trying to keep going I tried the next question and still did not understand how to store a value for harm.rate.

Try:

grouped.dat <- group_by (dat, day)

d.summary <-
summarize( grouped.dat, n=n(), 
    injuries= sum(Totalinjuries), 
    fatalities= sum(Totalfatalities),
    harm.rate=mean(injuries.fatalities) )

d.summary

Or more efficiently:

# print in RMD
dat %>%  
  group_by( day ) %>%
  summarize(  n=n(), 
    injuries= sum(Totalinjuries), 
    fatalities= sum(Totalfatalities),
    harm.rate=mean(injuries.fatalities) ) %>%
  pander()

# save
d.summary <- 
dat %>%  
  group_by( day ) %>%
  summarize(  
    n=n(), 
    injuries= sum(Totalinjuries), 
    fatalities= sum(Totalfatalities),
    harm.rate=mean(injuries.fatalities)  )
jmacost5 commented 5 years ago

Ok, going on to part 2 question 1, I found that the table is doing this:

------------------------------------------------
 Mon    Tue    Wed    Thu    Fri    Sat    Sun  
------ ------ ------ ------ ------ ------ ------
 4094   4656   4711   4814   5006   3044   2145 
------------------------------------------------

------------------------------------------------------------------------------
 12 AM   1 AM   2 AM   3 AM   4 AM   5 AM   6 AM   7 AM   8 AM   9 AM   10 AM 
------- ------ ------ ------ ------ ------ ------ ------ ------ ------ -------
   0      0      0      0      0      0      0      0      0      0       0   
------------------------------------------------------------------------------

Table: Table continues below

------------------------------------------------------------------------------
 11 AM   12 PM   1 PM   2 PM   3 PM   4 PM   5 PM   6 PM   7 PM   8 PM   9 PM 
------- ------- ------ ------ ------ ------ ------ ------ ------ ------ ------
   0       0      0      0      0      0      0      0      0      0      0   
------------------------------------------------------------------------------

Table: Table continues below

---------------
 10 PM   11 PM 
------- -------
   0       0   
---------------

When my code is this:

date.vec <- strptime( dat$DateTime, format="%m/%d/%y %H:%M" )

dat$hour   <- format( date.vec, format="%H" )
dat$month  <- format( date.vec, format="%b" )
dat$day    <- format( date.vec, format="%a" )
dat$day365 <- format( date.vec, format="%j" )
dat$week   <- format( date.vec, format="%V" )

dat$day <- factor( dat$day, levels=c("Mon","Tue","Wed","Thu","Fri","Sat","Sun"))
table( dat$day ) %>% pander()

dat$hour12 <- format( date.vec, format="%1 %p" )
time.levels <- c( "12 AM", " 1 AM", " 2 AM", " 3 AM", " 4 AM", " 5 AM", 
" 6 AM", " 7 AM", " 8 AM", " 9 AM", "10 AM", "11 AM", 
"12 PM", " 1 PM", " 2 PM", " 3 PM", " 4 PM", " 5 PM", 
" 6 PM", " 7 PM", " 8 PM", " 9 PM", "10 PM", "11 PM" ) 
dat$hour12 <- factor( dat$hour12, levels=time.levels )
table( dat$hour12 ) %>% pander()
lecy commented 5 years ago

This should be a lower-case "L" not a one:

 format="%1 %p"
lecy commented 5 years ago

@jmacost5 please be sure to put fences ``` around your code and output in the questions so it is easier to read.

jmacost5 commented 5 years ago

Ok sorry I will try for this question. For part 2 number 2 I looked at your example code and I tried my best to do it with the other variables with the table.

dat %>%
count (age,hour12)%>%
group_by (age) %>%
mutate (n.age=sum(age))%>%
mutate (n.hour=sum(hour12))%>%
mutate(p.age=( n / n.age))
mutate(p.hour= (n/n.hour))
mutate(mean(age+hour12))
summarize(age,hour12,n=n()) %>% 
pander()
jmacost5 commented 5 years ago

My next question is are we supposed to make graphs similar to the ones in the last part are we suppose to make our own graph?

lecy commented 5 years ago

The sum() function is a math function so it is used with a numeric vector. I think you want n.age=n() to count cases after grouping.

dat %>%
count( age, hour12 )%>%  # this one creates n, the count for each cell
group_by( age ) %>%
mutate ( n.age=n() )%>%
mutate( p.age=( n / n.age ) ) %>% 
##############
# data is currently grouped by age, so n.hour=n() will count ages not hours
# mutate ( n.hour=n() )%>%
# mutate( p.hour= ( n / n.hour ) ) %>% 
#############
# mutate( mean( age+hour12 ) )   # no idea what this is supposed to do? 
# summarize( age, hour12, n=n() ) %>% 
pander()