DS4PS / cpp-526-sum-2021

Coure shell for CPP 526.
https://ds4ps.org/cpp-526-sum-2021/
MIT License
1 stars 3 forks source link

Lab 05 Part 2 Question 2 - n values #33

Open jholwegn opened 3 years ago

jholwegn commented 3 years ago

I'm not getting the expected n values (77, 408, 371, etc.) for Lab 05 Part 2 Question 2.

Here is my current code: dat %>% group_by(age, hour12) %>% summarize(n = n()), n.hour = 1606, p = round(n / n.hour, 2)) %>% mutate(n.age = sum(n)) %>% group_by(age) %>% mutate(p.hour = round(n / n.hour, 2)) %>% mutate(p.age = round(n / n.age, 2)) %>% filter (hour12 == "7 AM")

The n values I'm currently getting when I run this code are 157, 835, 563, 314, 280, 184, 83, 46, 296. Any help is appreciated.

mtwelker commented 3 years ago

I'm getting the same values @jholwegn reports for 7AM with this code:

dat %>% count(hour12, age)

Here are the results for 7AM: image

But in the template for this week, it says: image

Is our code really meant to produce those values for 7AM, or is this just an example of what the table should look like?

jamisoncrawford commented 3 years ago

Here's what I'm getting when importing dat using the reformatted .rmd for Lab 05. You should get the same output:

url <- paste0("https://github.com/DS4PS/Data-Science-Class/blob",
              "/master/DATA/TempeTrafficAccidents.rds?raw=true")

dat <- readRDS(gzcon(url(url)))

date.vec <- strptime(dat$DateTime, 
                     format = "%m/%d/%y %H:%M")

dat$hour12 <- format(date.vec, 
                     format="%l %p")

time.levels <- c("12 AM", " 1 AM", " 2 AM", " 3 AM", " 4 AM", " 5 AM", 
                 " 6 AM", " 7 AM", " 8 AM", " 9 AM", "10 AM", "11 AM", 
                 "12 PM", " 1 PM", " 2 PM", " 3 PM", " 4 PM", " 5 PM", 
                 " 6 PM", " 7 PM", " 8 PM", " 9 PM", "10 PM", "11 PM" )

dat$hour12 <- factor(dat$hour12, 
                     levels = time.levels) # Order time intervals

age.labels <- paste0("Age ", 
                     c(16,18,25,35,45,55,65,75), "-", 
                     c(18,25,35,45,55,65,75,100) )

dat$age <- cut(dat$Age_Drv1, 
               breaks = c(16,18,25,
                          35,45,55,
                          65,75,100), 
               labels = age.labels)

dat %>% 
  filter(hour12 == " 7 AM") %>% 
  count(hour12, age)

  hour12        age   n
1   7 AM  Age 16-18  77
2   7 AM  Age 18-25 408
3   7 AM  Age 25-35 371
4   7 AM  Age 35-45 243
5   7 AM  Age 45-55 175
6   7 AM  Age 55-65 116
7   7 AM  Age 65-75  39
8   7 AM Age 75-100  17
9   7 AM       <NA> 160

Is dat being changed somewhere in your work?

jamisoncrawford commented 3 years ago

P.S. all the preprocessing is done in the code chunk on Line 87!

mtwelker commented 3 years ago

@jamisoncrawford, I think I've figured out the discrepancy. In your code above, you use factor on dat$hour12 this way:

dat$hour12 <- factor(dat$hour12, 
                     levels = time.levels)

But in the template, it is used with "labels" instead of "levels":

dat$hour12 <- factor(dat$hour12, 
                      labels = time.levels)  

When I run with as you indicate above with "levels," I get the numbers you provide above. But when I run the template as provided, with "labels," I get the numbers jholwegn and I report above.

So the question is: which one is producing the correct output?

mtwelker commented 3 years ago

Here's the code I used to test this:

# READ IN DATA

url <- paste0("https://github.com/DS4PS/Data-Science-Class/blob",
              "/master/DATA/TempeTrafficAccidents.rds?raw=true")

dat <- readRDS(gzcon(url(url)))     # Method per instructions

date.vec <- strptime(dat$DateTime, 
                     format = "%m/%d/%y %H:%M") 

dat$hour12 <- format(date.vec, 
                     format="%l %p")

# Code from template -- produces no rows here; do not use
# time.levels <- c("12 AM", paste(1:11, "AM"), 
#                  "12 PM", paste(1:11, "PM"))

# Code from @jamisoncrawford
time.levels <- c("12 AM", " 1 AM", " 2 AM", " 3 AM", " 4 AM", " 5 AM", 
                " 6 AM", " 7 AM", " 8 AM", " 9 AM", "10 AM", "11 AM", 
                "12 PM", " 1 PM", " 2 PM", " 3 PM", " 4 PM", " 5 PM", 
                " 6 PM", " 7 PM", " 8 PM", " 9 PM", "10 PM", "11 PM" )
# 

# Code from template
dat$hour12 <- factor(dat$hour12, 
                    labels = time.levels)   

# Code from @jamisoncrawford
# dat$hour12 <- factor(dat$hour12, 
#                    levels = time.levels) # Order time intervals

age.labels <- paste0("Age ", 
                     c(16,18,25,35,45,55,65,75), "-", 
                     c(18,25,35,45,55,65,75,100) )

dat$age <- cut(dat$Age_Drv1, 
               breaks = c(16,18,25,
                          35,45,55,
                          65,75,100), 
               labels = age.labels)

dat %>% 
  filter(hour12 == " 7 AM") %>% 
  count(hour12, age)
jamisoncrawford commented 3 years ago

Hm, apologies as I was under the impression the preprocessing code I used was the same as the .rmd template ya'll are using.

Here's a quicker version to process age 18-25 for 7 AM:

library(dplyr)
library(stringr)
library(lubridate)

url <- paste0("https://github.com/DS4PS/Data-Science-Class/blob",
              "/master/DATA/TempeTrafficAccidents.rds?raw=true")

dat <- readRDS(gzcon(url(url)))

dat %>% 
  mutate(dt = mdy_hm(DateTime),
         hour = hour(dt)) %>% 
  filter(Age_Drv1 > 18,
         Age_Drv1 <= 25,
         hour == 7) %>% 
  count()

Whichever method reproduces this figure should be the one to go with!

jholwegn commented 3 years ago

@jamisoncrawford, I changed the preprocessing code to the one that you provided and that brought up the correct n values!

@mtwelker Thanks for identifying the discrepancy & testing it :)

jamisoncrawford commented 3 years ago

@mtwelker has been a treasure 🤣 🔥

dholford commented 3 years ago

I've run into this same problem, and tried a million different combinations of everything. So for the good of the order, if you try this and it doesn't work, this is what worked for me:

Capture