DS4PS / cpp-529-master

Course files for CPP 529 Data Analytics Practicum focused on models of neighborhood change.
https://ds4ps.org/cpp-529-master/
2 stars 1 forks source link

Lab 05 #17

Open etbartell opened 4 years ago

etbartell commented 4 years ago

Is anyone else having an issue in Part 3 where the acs and acs12 dataframes get collapsed to a single number when you use the summarise function? It seems to have just disregarded every command above the last one. Here's the code I used:

acs <- 
  acs %>%
  mutate( id = str_extract(variable, "[0-9]{3}$") %>% as.integer ) %>%
  # variable 1 is the "total", which is just the sum of the others
  filter(id > 1) %>%
  mutate(education = case_when(
    id %>% between(2, 16) ~ "No HS diploma",
    id %>% between(17, 21) ~ "HS, no Bachelors",
    id > 21 ~ "At Least a Bachelors"
  )) %>% 
  group_by(GEOID, education) %>% 
  summarise(estimate = sum(estimate))

Then here's what it did to the dataframe: image

lecy commented 4 years ago

I'm mentioning @Anthony-Howell-PhD to make sure he is getting the notifications.

That code looks fine, it could be something you are doing earlier. Can you include a reproducible example?

etbartell commented 4 years ago

Here's everything I'm running up to that point: Step 1: Load packages

library(sf)
library(tidyverse)
library(tigris)
library(tidycensus)
library(ggrepel)
library(dplyr)
options(tigris_use_cache=TRUE)
options(tigris_class="sf")

Step 2: Input Census API Key

census_api_key("8eab9b16f44cb26460ecbde164482194b7052772")

Step 3: Get education data and create a data frame (see screenshot below for acs at this point)

acs <- get_acs("tract", table = "B15003", cache_table = TRUE,
               geometry = TRUE, state = "AZ", county = "Maricopa County",
               year = 2017, output = "tidy")

image

Step 4: Transform data to grouped factor levels (see screenshot for transformed acs after this step)

acs <- 
  acs %>%
  mutate( id = str_extract(variable, "[0-9]{3}$") %>% as.integer ) %>%
  # variable 1 is the "total", which is just the sum of the others
  filter(id > 1) %>%
  mutate(education = case_when(
    id %>% between(2, 16) ~ "No HS diploma",
    id %>% between(17, 21) ~ "HS, no Bachelors",
    id > 21 ~ "At Least a Bachelors"
  )) %>% 
  group_by(GEOID, education) %>% 
  summarise(estimate = sum(estimate))

image

AntJam-Howell commented 4 years ago

This code works for me.

# load libraries
library(sf)
library(tidyverse)
library(tigris)
library(tidycensus)
library(ggrepel)
options(tigris_use_cache=TRUE)
options(tigris_class="sf")

census_api_key("8eab9b16f44cb26460ecbde164482194b7052772")

#Bring in 2017 variable related to educational attainment
acs <- get_acs("tract", table = "B15003", cache_table = TRUE,
               geometry = TRUE, state = "AZ", county = "Maricopa County",
               year = 2017, output = "tidy")
acs

#The educational attainment splits things out to quite a few levels (with one for “finished 4th grade” and another for “finished 5th grade” and so on), so I’ll collapse them down to a handful of categories.

acs <- acs %>%
  mutate(
    id = str_extract(variable, "[0-9]{3}$") %>% as.integer
  ) %>%
  # variable 1 is the "total", which is just the sum of the others
  filter(id > 1) %>%
  mutate(education =case_when(
    id %>% between(2, 16) ~ "No HS diploma",
    id %>% between(17, 21) ~ "HS, no Bachelors",
    id > 21 ~ "At Least a Bachelors"
  )) %>% 
  group_by(GEOID, education) %>% 
  summarise(estimate = sum(estimate))

acs
Screenshot 2019-11-17 16 45 47
etbartell commented 4 years ago

Yeah I copy-pasted your code just to make sure I didn't miss something and it's still giving me the same result. I know conceptually what each step is supposed to do to the data but for some reason it's ignoring everything except the "summarise" step.

AntJam-Howell commented 4 years ago

That is strange. Can see if anyone else an issue. In the meantime, here’s some options: 1) check make sure all the package libraries loaded are Updated. 2) run the code in r instead of remarried and 3) break down the code into its individual parts without the piping. You can trace what is going on that way.

sunaynagoel commented 4 years ago

I am having trouble running Generating Dot chunk of PART III .rmd file. I have not changed anything was just trying to run the sample file.

acs_split <- acs %>%
  filter(estimate > 50) %>%
  split(.$education)

generate_samples <- function(data) 
  suppressMessages(st_sample(data, size = round(data$estimate / 100)))

points <- map(acs_split, generate_samples)
points <- imap(points, 
               ~st_sf(data_frame(education = rep(.y, length(.x))),
                      geometry = .x))
points <- do.call(rbind, points)

Error in split.default(x = seq_len(nrow(x)), f = f, drop = drop, ...) : group length is 0 but data length > 0

katiegentry07 commented 4 years ago

@Anthony-Howell-PhD @sunaynagoel I am having the same issue

AntJam-Howell commented 4 years ago

If you try to run code from part 3 with code from parts 1 and 2 in the same .rmd file, it will not compile correctly. To avoid this error, make sure that the code from part 3 RMD file is placed into a new, separate RMD file from Parts 1 and 2. Let me know if that helps.

cjbecerr commented 4 years ago

Yeah I copy-pasted your code just to make sure I didn't miss something and it's still giving me the same result. I know conceptually what each step is supposed to do to the data but for some reason it's ignoring everything except the "summarise

That is strange. Can see if anyone else an issue. In the meantime, here’s some options: 1) check make sure all the package libraries loaded are Updated. 2) run the code in r instead of remarried and 3) break down the code into its individual parts without the piping. You can trace what is going on that way.

Having the same issue. Has this been solved yet?

AntJam-Howell commented 4 years ago

@cjbecerr have you tried any of the following?:

1) check make sure all the package libraries loaded are Updated.

2) run the code in r instead of Rmarkdown.

3) break down the code into its individual parts without the piping.

If after doing (1)-(2), the issue is still not resolved, doing step (3) above will help you trace what is going on and why the data is collapsing to only one cell.

cjbecerr commented 4 years ago

@Anthony-Howell-PhD After trouble shooting, I'm seeing everything works fine in both R and Rmarkdown until it reaches the summarise(). It's like it ignores the group_by() and takes the sum with no groupings.

AntJam-Howell commented 4 years ago

Ok, sorry to hear that. Can you attach your .R file (not .rmd file) and I will check it out.

JaesaR commented 4 years ago

I am having trouble running Generating Dot chunk of PART III .rmd file. I have not changed anything was just trying to run the sample file.

acs_split <- acs %>%
  filter(estimate > 50) %>%
  split(.$education)

generate_samples <- function(data) 
  suppressMessages(st_sample(data, size = round(data$estimate / 100)))

points <- map(acs_split, generate_samples)
points <- imap(points, 
               ~st_sf(data_frame(education = rep(.y, length(.x))),
                      geometry = .x))
points <- do.call(rbind, points)

Error in split.default(x = seq_len(nrow(x)), f = f, drop = drop, ...) : group length is 0 but data length > 0

I am also having this issue even though my code from parts 1 & 2 are in a separate rmd file than part 3

AntJam-Howell commented 4 years ago

Ok, sorry to hear that. Can you attach your .R file (not .rmd file) and I will check it out.

cjbecerr commented 4 years ago

Ok, sorry to hear that. Can you attach your .R file (not .rmd file) and I will check it out.

Figured it out. Actually turned out to be my packages, needed to restart R. Thank you for the help!

AntJam-Howell commented 4 years ago

@sunaynagoel @JaesaR. Thanks Jaesa for sending you .r file to me. I've run through each line of code and I can not reproduce the error you both are getting. Everything works fine on my end. Please do the following: (1) update all of the library packages in rstudio and re-run code; (2) if the problem persists, replace ~st_sf(data_frame(education = rep(.y, length(.x))), with ~st_sf(tibble(education = rep(.y, length(.x))),

On my end, I get a warning that date_frame is deprecated, and to use tibble instead. That may be the problem depending on what version of packages you are running.

sunaynagoel commented 4 years ago

Thank you for your help. I was able to make mine work.

sunaynagoel commented 4 years ago

@Anthony-Howell-PhD. Is it ok just submit .rmd file. Everything works when I run them separately in R console but when I try to produce a html file, it gives errors. Thanks

AntJam-Howell commented 4 years ago

In this case you can upload the rmd and .r files. I can check the .r file easier to make sure everything is working correctly.

Jigarci3 commented 4 years ago

Thank you for your help. I was able to make mine work.

Hello @sunaynagoel, How were you able to make it work? I have run into the same issue.

sunaynagoel commented 4 years ago

@Jigarci3 changing ~st_sf(data_frame(education = rep(.y, length(.x))), with ~st_sf(tibble(education = rep(.y, length(.x))), as suggested by @Anthony-Howell-PhD did the trick. for part 3. for some reason my package dplyr was masking gridextra for part 1 and 2. I also had to restart R. Let me know if the helps.

castower commented 4 years ago

@Anthony-Howell-PhD I'm a bit confused how to find the .r files for submission. Is there a specific place I should look in RStudio? Thanks!

AntJam-Howell commented 4 years ago
Screenshot 2019-11-20 18 01 18

If you are not familiar with .R scripts, then send .rmd file only is fine. For your reference though, R script (.r) can be opened as shown in the screenshot.

castower commented 4 years ago
Screenshot 2019-11-20 18 01 18

If you are not familiar with .R scripts, then send .rmd file only is fine. For your reference though, R script (.r) can be opened as shown in the screenshot.

Thank you!

etbartell commented 4 years ago

I was just wondering, are we supposed to submit both rmd files? I know it says just to submit the files for part 3 but since it wouldn't run with the code from part 2, the plots for question 3 wouldn't be included in the submission.

AntJam-Howell commented 4 years ago

Whichever is easiest for you is ok by me.

castower commented 4 years ago

Hello all, I'm curious if I'm doing something wrong. In part 3, I'm trying to create my plots with the following code:

census_api_key("my_census_key")

Var<-c("B19013_001", "B25077_001")
## c(Median household Income, Median Housing Value)
  CenDF <- get_acs(geography = "county",
                   variables = Var,
                   year = 2017,
                   survey = "acs5",
                   geometry = TRUE,
                   shift_geo = TRUE) 

CenDF<-CenDF %>% 
    mutate(variable=case_when( 
      variable=="B19013_001" ~ "HHIncome",
      variable=="B25077_001" ~ "HouseValue")) %>%
    select(-moe) %>%  
    spread(variable, estimate) %>%  #Spread moves rows into columns
    mutate(HHInc_HousePrice_Ratio=round(HouseValue/HHIncome,2)) 

Var<-c("B19013_001","B25077_001")

# Download 2008-2012 df
  CenDF2012 <- get_acs(geography = "county",
                   variables = Var,
                   year = 2012,
                   survey = "acs5",
                   geometry = FALSE)

  #Create new variable for the housing price to income ratio. 
CenDF2012<-CenDF2012 %>% 
  mutate(variable=case_when( 
    variable=="B19013_001" ~ "HHIncome2012",
    variable=="B25077_001" ~ "HouseValue2012")) %>%
  select(-moe,-NAME) %>%  
  spread(variable, estimate) %>%  #Spread moves rows into columns
  mutate(HHInc_HousePrice_Ratio2012=round(HouseValue2012/HHIncome2012,2)) 

CenDF<-merge(CenDF,CenDF2012,by.all="GEOID", all.x=TRUE)

CenDF<-CenDF %>%
mutate(pct_change = 100 * (`HHInc_HousePrice_Ratio` - `HHInc_HousePrice_Ratio2012`) / `HHInc_HousePrice_Ratio2012`)
 ```{r}

library(viridis)
library(gtools)

upper_limit <- round(max(CenDF$pct_change,na.rm=TRUE) + 10, -1)
lower_limit <- round(min(CenDF$pct_change,na.rm=TRUE) - 10, -1)

CenDF$fill_factor <- quantcut(CenDF$HHInc_HousePrice_Ratio, q = c(0,.1,.25,.5,.75,.9,1))

col.ramp <- viridis(n = 6) 

Plot11<-  ggplot(CenDF,aes(fill = pct_change)) +
  geom_sf(size = 0) +
  #geom_sf(data = major_roads_geo, color = "white", size = 0.8, fill = NA) +
  #geom_sf(data = minor_roads_geo, color = "white", size = 0.4, fill = NA) +
  scale_fill_manual("Price-Income Ratio",values =  col.ramp)+
  labs(title="Changes in House Price to Income Ratio",
       subtitle = "2017 5-Year Estimates vs. 2012 5-Year Estimates for Census Tracts",
       caption = paste0(
         "Data sources:",
         "\n  U.S. Census Bureau, 2012 and 2017 American Community Survey 5-Year Estimates"
       )
  ) +
  theme(plot.caption = element_text(hjust = 0, margin = margin(t = 15))) +
  theme(axis.ticks = element_blank(), axis.text = element_blank()) +
  theme(panel.background = element_blank())
Plot11

My plot keeps giving me the following error:

Error: Continuous value supplied to discrete scale

It was my understanding that this line of code made the scale discrete:

CenDF$fill_factor <- quantcut(CenDF$HHInc_HousePrice_Ratio, q = c(0,.1,.25,.5,.75,.9,1))

Is this incorrect?

castower commented 4 years ago

Hello all, I'm curious if I'm doing something wrong. In part 3, I'm trying to create my plots with the following code:

census_api_key("my_census_key")

Var<-c("B19013_001", "B25077_001")
## c(Median household Income, Median Housing Value)
  CenDF <- get_acs(geography = "county",
                   variables = Var,
                   year = 2017,
                   survey = "acs5",
                   geometry = TRUE,
                   shift_geo = TRUE) 

CenDF<-CenDF %>% 
    mutate(variable=case_when( 
      variable=="B19013_001" ~ "HHIncome",
      variable=="B25077_001" ~ "HouseValue")) %>%
    select(-moe) %>%  
    spread(variable, estimate) %>%  #Spread moves rows into columns
    mutate(HHInc_HousePrice_Ratio=round(HouseValue/HHIncome,2)) 

Var<-c("B19013_001","B25077_001")

# Download 2008-2012 df
  CenDF2012 <- get_acs(geography = "county",
                   variables = Var,
                   year = 2012,
                   survey = "acs5",
                   geometry = FALSE)

  #Create new variable for the housing price to income ratio. 
CenDF2012<-CenDF2012 %>% 
  mutate(variable=case_when( 
    variable=="B19013_001" ~ "HHIncome2012",
    variable=="B25077_001" ~ "HouseValue2012")) %>%
  select(-moe,-NAME) %>%  
  spread(variable, estimate) %>%  #Spread moves rows into columns
  mutate(HHInc_HousePrice_Ratio2012=round(HouseValue2012/HHIncome2012,2)) 

CenDF<-merge(CenDF,CenDF2012,by.all="GEOID", all.x=TRUE)

CenDF<-CenDF %>%
mutate(pct_change = 100 * (`HHInc_HousePrice_Ratio` - `HHInc_HousePrice_Ratio2012`) / `HHInc_HousePrice_Ratio2012`)
 ```{r}

library(viridis)
library(gtools)

upper_limit <- round(max(CenDF$pct_change,na.rm=TRUE) + 10, -1)
lower_limit <- round(min(CenDF$pct_change,na.rm=TRUE) - 10, -1)

CenDF$fill_factor <- quantcut(CenDF$HHInc_HousePrice_Ratio, q = c(0,.1,.25,.5,.75,.9,1))

col.ramp <- viridis(n = 6) 

Plot11<-  ggplot(CenDF,aes(fill = pct_change)) +
  geom_sf(size = 0) +
  #geom_sf(data = major_roads_geo, color = "white", size = 0.8, fill = NA) +
  #geom_sf(data = minor_roads_geo, color = "white", size = 0.4, fill = NA) +
  scale_fill_manual("Price-Income Ratio",values =  col.ramp)+
  labs(title="Changes in House Price to Income Ratio",
       subtitle = "2017 5-Year Estimates vs. 2012 5-Year Estimates for Census Tracts",
       caption = paste0(
         "Data sources:",
         "\n  U.S. Census Bureau, 2012 and 2017 American Community Survey 5-Year Estimates"
       )
  ) +
  theme(plot.caption = element_text(hjust = 0, margin = margin(t = 15))) +
  theme(axis.ticks = element_blank(), axis.text = element_blank()) +
  theme(panel.background = element_blank())
Plot11

My plot keeps giving me the following error:

Error: Continuous value supplied to discrete scale

It was my understanding that this line of code made the scale discrete:

CenDF$fill_factor <- quantcut(CenDF$HHInc_HousePrice_Ratio, q = c(0,.1,.25,.5,.75,.9,1))

Is this incorrect?

I figured it out after staring at my code for a while...I forgot to change the variable I was plotting.

Niagara1000 commented 4 years ago

Hi, I'm getting an error when I run this code.

acs_split <- acs %>%
  filter(estimate > 50) %>%
  split(.$education)

This is from Lab 5 Part 2 instructions for CPP 529. It can be found here : https://ds4ps.org/cpp-529-spr-2020/LABS/Lab5b-MapVis2.html under 'Generating Dots' heading.

Error in split.default(x=seq_len(nrow(x)), f=f, drop=drop, ...) : group length is 0 but data length > 0

But oddly enough, when I knit the entire document, the error doesn't appear and the document gets knitted fully. So, I don't know what is going on