Open jholwegn opened 3 years ago
I'm getting the same values @jholwegn reports for 7AM with this code:
dat %>% count(hour12, age)
Here are the results for 7AM:
But in the template for this week, it says:
Is our code really meant to produce those values for 7AM, or is this just an example of what the table should look like?
Here's what I'm getting when importing dat
using the reformatted .rmd for Lab 05. You should get the same output:
url <- paste0("https://github.com/DS4PS/Data-Science-Class/blob",
"/master/DATA/TempeTrafficAccidents.rds?raw=true")
dat <- readRDS(gzcon(url(url)))
date.vec <- strptime(dat$DateTime,
format = "%m/%d/%y %H:%M")
dat$hour12 <- format(date.vec,
format="%l %p")
time.levels <- c("12 AM", " 1 AM", " 2 AM", " 3 AM", " 4 AM", " 5 AM",
" 6 AM", " 7 AM", " 8 AM", " 9 AM", "10 AM", "11 AM",
"12 PM", " 1 PM", " 2 PM", " 3 PM", " 4 PM", " 5 PM",
" 6 PM", " 7 PM", " 8 PM", " 9 PM", "10 PM", "11 PM" )
dat$hour12 <- factor(dat$hour12,
levels = time.levels) # Order time intervals
age.labels <- paste0("Age ",
c(16,18,25,35,45,55,65,75), "-",
c(18,25,35,45,55,65,75,100) )
dat$age <- cut(dat$Age_Drv1,
breaks = c(16,18,25,
35,45,55,
65,75,100),
labels = age.labels)
dat %>%
filter(hour12 == " 7 AM") %>%
count(hour12, age)
hour12 age n
1 7 AM Age 16-18 77
2 7 AM Age 18-25 408
3 7 AM Age 25-35 371
4 7 AM Age 35-45 243
5 7 AM Age 45-55 175
6 7 AM Age 55-65 116
7 7 AM Age 65-75 39
8 7 AM Age 75-100 17
9 7 AM <NA> 160
Is dat
being changed somewhere in your work?
P.S. all the preprocessing is done in the code chunk on Line 87!
@jamisoncrawford, I think I've figured out the discrepancy. In your code above, you use factor on dat$hour12 this way:
dat$hour12 <- factor(dat$hour12,
levels = time.levels)
But in the template, it is used with "labels" instead of "levels":
dat$hour12 <- factor(dat$hour12,
labels = time.levels)
When I run with as you indicate above with "levels," I get the numbers you provide above. But when I run the template as provided, with "labels," I get the numbers jholwegn and I report above.
So the question is: which one is producing the correct output?
Here's the code I used to test this:
# READ IN DATA
url <- paste0("https://github.com/DS4PS/Data-Science-Class/blob",
"/master/DATA/TempeTrafficAccidents.rds?raw=true")
dat <- readRDS(gzcon(url(url))) # Method per instructions
date.vec <- strptime(dat$DateTime,
format = "%m/%d/%y %H:%M")
dat$hour12 <- format(date.vec,
format="%l %p")
# Code from template -- produces no rows here; do not use
# time.levels <- c("12 AM", paste(1:11, "AM"),
# "12 PM", paste(1:11, "PM"))
# Code from @jamisoncrawford
time.levels <- c("12 AM", " 1 AM", " 2 AM", " 3 AM", " 4 AM", " 5 AM",
" 6 AM", " 7 AM", " 8 AM", " 9 AM", "10 AM", "11 AM",
"12 PM", " 1 PM", " 2 PM", " 3 PM", " 4 PM", " 5 PM",
" 6 PM", " 7 PM", " 8 PM", " 9 PM", "10 PM", "11 PM" )
#
# Code from template
dat$hour12 <- factor(dat$hour12,
labels = time.levels)
# Code from @jamisoncrawford
# dat$hour12 <- factor(dat$hour12,
# levels = time.levels) # Order time intervals
age.labels <- paste0("Age ",
c(16,18,25,35,45,55,65,75), "-",
c(18,25,35,45,55,65,75,100) )
dat$age <- cut(dat$Age_Drv1,
breaks = c(16,18,25,
35,45,55,
65,75,100),
labels = age.labels)
dat %>%
filter(hour12 == " 7 AM") %>%
count(hour12, age)
Hm, apologies as I was under the impression the preprocessing code I used was the same as the .rmd template ya'll are using.
Here's a quicker version to process age 18-25 for 7 AM:
library(dplyr)
library(stringr)
library(lubridate)
url <- paste0("https://github.com/DS4PS/Data-Science-Class/blob",
"/master/DATA/TempeTrafficAccidents.rds?raw=true")
dat <- readRDS(gzcon(url(url)))
dat %>%
mutate(dt = mdy_hm(DateTime),
hour = hour(dt)) %>%
filter(Age_Drv1 > 18,
Age_Drv1 <= 25,
hour == 7) %>%
count()
Whichever method reproduces this figure should be the one to go with!
@jamisoncrawford, I changed the preprocessing code to the one that you provided and that brought up the correct n values!
@mtwelker Thanks for identifying the discrepancy & testing it :)
@mtwelker has been a treasure 🤣 🔥
I've run into this same problem, and tried a million different combinations of everything. So for the good of the order, if you try this and it doesn't work, this is what worked for me:
I'm not getting the expected n values (77, 408, 371, etc.) for Lab 05 Part 2 Question 2.
Here is my current code:
dat %>% group_by(age, hour12) %>% summarize(n = n()), n.hour = 1606, p = round(n / n.hour, 2)) %>% mutate(n.age = sum(n)) %>% group_by(age) %>% mutate(p.hour = round(n / n.hour, 2)) %>% mutate(p.age = round(n / n.age, 2)) %>% filter (hour12 == "7 AM")
The n values I'm currently getting when I run this code are 157, 835, 563, 314, 280, 184, 83, 46, 296. Any help is appreciated.