DS4PS / cpp-529-spr-2021

Course shell for CPP 529 Community Analytics
http://ds4ps.org/cpp-529-spr-2021/
2 stars 0 forks source link

Final Dashboard - adding variable names #35

Open swest235 opened 4 months ago

swest235 commented 4 months ago

I can not for the life of me understand what this is saying to do.

@AntJam-Howell Can you please clarify this and what exactly it is doing?

Step by step, please.

Adding interpretable variable names

from the data dictionary:

add a name attribute for each variable

value <- c(1,2,3)

dd.name <- c("one","two","three")

x <- dd.name

names(x) <- value

#

dd names and values linked

names( x[2] )

#

can now get the label using the value

using the name attributes

x[ "two" ]

#

to add labels to the maps

use the radio button value

to get the data dictionary label:

#

x[ input$demographics ]

swest235 commented 4 months ago

@AntJam-Howell I was able to get the names displayed with the help of GPT - but I would still like to get clarification on what the instructions were trying to say in my original post. Can you please also explain what the point of temp.names was and how it was supposed to relate to the #instructions at the end of that code chunk?

Code Used with help of GPT: these.variables <- c("pnhwht12", "pnhblk12", "phisp12", "pntv12", "pfb12", "polang12", "phs12", "pcol12", "punemp12", "pflabf12", "pprof12", "pmanuf12", "pvet12", "psemp12", "hinc12", "incpc12", "ppov12", "pown12", "pvac12", "pmulti12", "mrent12", "mhmval12", "p30old12", "p10yrs12", "p18und12", "p60up12", "p75up12", "pmar12", "pwds12", "pfhh12")

dd.name <- c("Percent white, non-Hispanic", "Percent black, non-Hispanic", "Percent Hispanic", "Percent Native American race", "Percent foreign born", "Percent speaking other language at home, age 5 plus", "Percent with high school degree or less", "Percent with 4-year college degree or more", "Percent unemployed", "Percent female labor force participation", "Percent professional employees", "Percent manufacturing employees", "Percent veteran", "Percent self-employed", "Median HH income, total", "Per capita income", "Percent in poverty, total", "Percent owner-occupied units", "Percent vacant units", "Percent multi-family units", "Median rent", "Median home value", "Percent structures more than 30 years old", "Percent HH in neighborhood 10 years or less", "Percent 17 and under, total", "Percent 60 and older, total", "Percent 75 and older, total", "Percent currently married, not separated", "Percent widowed, divorced and separated", "Percent female-headed families with children")

name_mapping <- setNames(dd.name, these.variables)

choice_names = as.vector(name_mapping[these.variables])

replace these with descriptive labels

from the data dictionary

temp.names <- paste0( "Variabls", these.variables ) <<< what is this even for and how does it connect with the names(x) part of the instructions?

radioButtons( inputId="demographics", label = h3("Census Variables"),

choices = these.variables,

          choiceNames=choice_names,
          choiceValues=these.variables,
          selected="pnhwht12")
AntJam-Howell commented 4 months ago

FYI: When posting Github, highlight the code and use the <> button above so that the code can be easily distinguished from text (this is an important feature of github over say using canvas or email).

With respect to your first comment, the provided R code snippet is meant to be instructive and demonstrates how to create a mapping between numerical values and human-readable names, commonly used for variable labeling or creating a data dictionary. It uses named vectors to establish this mapping, allowing for easy reference to variable names based on their values. Let's annotate this code step-by-step:

Step 1: Create a Vector with Numerical Values The value vector contains numerical data, which can represent keys or indices. In this example, it's a simple vector with three numbers.

# Define a vector of numerical values
value <- c(1, 2, 3)

Step 2: Define a Vector with Corresponding Names The dd.name vector contains human-readable names that correspond to the value vector. This vector will be used to create the name attributes.

# Define a vector of corresponding names (from a data dictionary)
dd.name <- c("one", "two", "three")

Step 3: Create a Named Vector The x vector is created by assigning the dd.name vector as its data, then setting the names attribute with the value vector. This creates a named vector where each name corresponds to a specific value.

# Create a named vector with values as names
x <- dd.name

# Assign numerical values as the names attribute
names(x) <- value

Step 4: Retrieve Names Based on Values This step demonstrates how to retrieve the name attribute using a given value. By using the names() function, you can extract the name corresponding to a specific index in the vector.

# Get the name corresponding to a specific value
# Retrieve the name for the second item in the vector
name_of_second_item <- names(x[2])

Step 5: Retrieve Items Based on Names Here, you can get the value (data) based on a name, which is helpful when looking up items using their human-readable names.

# Retrieve the item based on its name
# Access the data associated with the name "two"
item_with_name_two <- x["two"]

Step 6: Using Named Attributes in a Shiny App This step demonstrates how you might use named attributes in a Shiny app context. You can refer to a particular name based on a user's input or a radio button's value.

# Retrieve the data dictionary label using user input
# This could be used in a Shiny app to get a label
# for a given demographic based on the input value
demographic_label <- x[input$demographics]
AntJam-Howell commented 4 months ago

With respect to your second comment,

temp.names <- paste0( "Variable", these.variables )  
# what is this even for and how does it connect with the names(x) part of the instructions?

This line of code can be useful when you need to generate descriptive labels or variable names in a dynamic way, especially if you're working with data structures or elements that require consistent naming.

The code snippet creates a new vector temp.names by concatenating a string ("Variable") with the values from these.variables. This is often used to create descriptive names or identifiers for variables in a consistent format. Here's an annotated breakdown:

paste0(): This function concatenates strings without a separator. If you want to concatenate with a specific delimiter, you would use paste(). "Variable": A static string that serves as a prefix. This could represent a common label for a set of variables. these.variables: This should be an existing vector with elements that you want to concatenate with the string "Variable." It could be a sequence of numbers, characters, or other values representing variable identifiers or indices.

The resulting temp.names will contain a new vector with each element formatted by concatenating "Variable" with corresponding elements from these.variables.

# Create variable names by concatenating a prefix ("Variable")
# with the values from 'these.variables' vector
temp.names <- paste0("Variable", these.variables)

Now, when you are ready to create the dashboard, you can substitute temp.names for var.names and you will be able to quickly see the difference when inspecting the dashboard.

radioButtons( inputId="demographics",
label = h3("Census Variables"),
# choices = these.variables,
choiceNames=choice_names, # change to temp.names to compare
choiceValues=these.variables,
selected="pnhwht12")
swest235 commented 4 months ago

@AntJam-Howell Thank you for the reply, but these instructions are unclear on how to tie back into the radio button function. Even GPT says the temp.names has no clear connection to the rest of the instructions. The steps are still vague, at least when it comes to how these are to all interlink.

Can you please elaborate further and provide more clear instructions?

What is the additional value of using this method when the method I did got the readable names on the dashboard?

I think I have done as the instructions requested, but I do not understand how these should connect to my radio buttons. I've shared screenshots and code. See below:

x: image

names(x): image

names(x[2]): image

x["pnhblk12"]: image

This is what the variables look like on the dashboard: image image image

Input Sidebar:

these.variables <- c("pnhwht12", "pnhblk12", "phisp12", "pntv12", "pfb12", "polang12", 
"phs12", "pcol12", "punemp12", "pflabf12", "pprof12", "pmanuf12", 
"pvet12", "psemp12", "hinc12", "incpc12", "ppov12", "pown12", 
"pvac12", "pmulti12", "mrent12", "mhmval12", "p30old12", "p10yrs12", 
"p18und12", "p60up12", "p75up12", "pmar12", "pwds12", "pfhh12")

value <-c( "pnhwht12", "pnhblk12", "phisp12", "pntv12", "pfb12", "polang12", 
"phs12", "pcol12", "punemp12", "pflabf12", "pprof12", "pmanuf12", 
"pvet12", "psemp12", "hinc12", "incpc12", "ppov12", "pown12", 
"pvac12", "pmulti12", "mrent12", "mhmval12", "p30old12", "p10yrs12", 
"p18und12", "p60up12", "p75up12", "pmar12", "pwds12", "pfhh12")

dd.name <- c("Percent white, non-Hispanic", 
                 "Percent black, non-Hispanic", "Percent Hispanic", 
                 "Percent Native American race", "Percent foreign born", 
                 "Percent speaking other language at home, age 5 plus", 
                 "Percent with high school degree or less", 
                 "Percent with 4-year college degree or more", 
                 "Percent unemployed", "Percent female labor force participation", 
                 "Percent professional employees", 
                 "Percent manufacturing employees", 
                 "Percent veteran", "Percent self-employed", 
                 "Median HH income, total", "Per capita income", 
                 "Percent in poverty, total", "Percent owner-occupied units", 
                 "Percent vacant units", "Percent multi-family units", 
                 "Median rent", "Median home value", 
                 "Percent structures more than 30 years old",
                 "Percent HH in neighborhood 10 years or less", 
                 "Percent 17 and under, total", "Percent 60 and older, total",
                 "Percent 75 and older, total", 
                 "Percent currently married, not separated", 
                 "Percent widowed, divorced and separated", 
                 "Percent female-headed families with children")

x<- dd.name
names(x)<-value
name_of_second_item <- names(x[2])
item_with_name_two <- x["Percent widowed, divorced and separated"]

# replace these with descriptive labels 
# from the data dictionary 
temp.names <- paste0( "Variable ", these.variables )

radioButtons( inputId="demographics", 
              label = h3("Census Variables"),
              choices = these.variables, 
              choiceNames=temp.names,
              choiceValues=these.variables,
              selected="pnhwht12")
swest235 commented 4 months ago

@AntJam-Howell I apologize, in an effort to keep related posts similar I have a few posts in different threads. To streamline this, I'm assembling everything here - although some questions and information I wont put here or it will become even more cluttered. I'll reference to the originals if needed.

1) I found a post you made to another student with the dorling solutions file - I created an alternative rmd and used that as my dorling, most things are working better.

2) Unresolved variable labeling procedures from the above post. What I have done works, as far as labels for the variables are concerned, but I wonder if there is an issue with my choice of code as opposed to the instructions. Either way, I still have no clue how to tie the temp.names to the dd.names and names(x) - I fail to see the connection.

3) Community Demographics: a) I am running into MHV variable producing an error object object in my choropleth and variable distribution tabs. All other variables seem to work fine, but this one specifically generates the errors. (code for dorling is below)

3) NH Change 2000-2010: a) variables mhv 2000 & mhv 2010 produce error object object for choropleth, distribution tabs says undefined columns b) variable value change 2000-2010 seems to work on both map and distributions c) growth in home value produces a map, but gives me Error: missing values and NaN's not allowed if 'na.rm' is FALSE. .

4) a) I kept having issues with the xylims so I deleted them and the maps that generate show the appropriate area without them, is that fine or may that lead to problems in other areas of the dashboard? I posted a related question about that too if you could answer those questions clarifying the epsg's. The question is in this thread (https://github.com/Watts-College/cpp-529-spr-2023/issues/26)

I'll have other issues, I know. But this is a one-stop for all the other ones I posted. I think.

Below is my code and setup for my dorling:


# libraries used to build the file

library( geojsonio )   # read shapefiles
library( sp )          # work with shapefiles
library( sf )          # work with shapefiles - simple features format
library( mclust )      # cluster analysis 
library( tmap )        # theme maps
library( ggplot2 )     # graphing 
library( ggthemes )    # nice formats for ggplots
library( dplyr )       # data wrangling 
library( pander )      # formatting RMD tables
library( tidycensus )
library( cartogram )  # spatial maps w/ tract size bias reduction
#library( maptools )   # spatial object manipulation 

# clear the workspace
rm( list = ls() )

# set the api key
census_api_key( "8f1ce150e65b8cba01951fbcbbe65ebbb9409638" )

Step 1: Create the Dorling Cartogram


# get the crosswalk data
crosswalk <- read.csv( "https://raw.githubusercontent.com/DS4PS/cpp-529-master/master/data/cbsatocountycrosswalk.csv",  stringsAsFactors=FALSE, colClasses="character" )

# selector variable for San Diego
these.msp <- 
  crosswalk$msaname == grep( "^SAN DIEGO", crosswalk$msaname, value=TRUE ) 

# take just San Diego
these.fips <- crosswalk$fipscounty[ these.msp ]
these.fips <- na.omit( these.fips )

# state and county fips
state.fips <- substr( these.fips, 1, 2 )
county.fips <- substr( these.fips, 3, 5 )

# get the population data
sd.pop <- get_acs( 
  geography = "tract", 
  variables = "B01003_001", 
  state = state.fips, 
  county = county.fips,
  geometry = TRUE ) %>% 
  dplyr::select( GEOID, estimate ) %>%
  dplyr::rename( POP=estimate )

# recode the GEIOD variable to conform with the census data
# remove the leading zero
sd.pop$GEOID<-sub( ".","", sd.pop$GEOID )

# add the census data
URL <- "https://github.com/DS4PS/cpp-529-master/raw/master/data/ltdb_std_2010_sample.rds"
census.dat <- readRDS( gzcon( url( URL ) ) )

# merge the pop data for Albuquerque with the census data
sdd <- merge( sd.pop, census.dat, by.x="GEOID", by.y="tractid" )

# make sure there are no empty polygons
sdd <- sdd[ ! st_is_empty( sdd ) , ]

# convert sf map object to an sp version
sdd.sp <- as_Spatial( sdd )

# project map and remove empty tracts
sdd.sp <- spTransform( sdd.sp, CRS( "+init=epsg:3395" ) )
sdd.sp <- sdd.sp[ sdd.sp$POP != 0 & (! is.na( sdd.sp$POP ) ) , ]

# standardizes it to max of 1.12
sdd.sp$pop.w <- sdd.sp$POP / 4000 

# convert census tract polygons to dorling cartogram
sd_dorling <- cartogram_dorling( x=sdd.sp, weight="pop.w", k=0.05 )

Step 2: Add Clusters


# define the variables we want for the cluster analysis
keep.these <- c("pnhwht12", "pnhblk12", "phisp12", "pntv12", "pfb12", "polang12", 
"phs12", "pcol12", "punemp12", "pflabf12", "pprof12", "pmanuf12", 
"pvet12", "psemp12", "hinc12", "incpc12", "ppov12", "pown12", 
"pvac12", "pmulti12", "mrent12", "mhmval12", "p30old12", "p10yrs12", 
"p18und12", "p60up12", "p75up12", "pmar12", "pwds12", "pfhh12")

# reduce the data structure for the cluster analysis
d1 <- sd_dorling@data
d2 <- dplyr::select( d1, keep.these )

# standardize the variables
d3 <- apply( d2, 2, scale )

# estimate the clusters
set.seed( 1234 )
fit <- Mclust( d3 )
sd_dorling$cluster <- as.factor( fit$classification )

Step 3: Add Census Data


URL1 <- "https://github.com/DS4PS/cpp-529-fall-2020/raw/main/LABS/data/rodeo/LTDB-2000.rds"
d1 <- readRDS( gzcon( url( URL1 ) ) )

URL2 <- "https://github.com/DS4PS/cpp-529-fall-2020/raw/main/LABS/data/rodeo/LTDB-2010.rds"
d2 <- readRDS( gzcon( url( URL2 ) ) )

URLmd <- "https://github.com/DS4PS/cpp-529-fall-2020/raw/main/LABS/data/rodeo/LTDB-META-DATA.rds"
md <- readRDS( gzcon( url( URLmd ) ) )

d1 <- dplyr::select( d1, - year )
d2 <- dplyr::select( d2, - year )

d <- merge( d1, d2, by="tractid" )
d <- merge( d, md, by="tractid" )

# filter rural census tracts
d <- filter( d, urban == "urban" )

# keep variables you want for the merge
keep.us <- c( "tractid", "mhmval00", "mhmval12" )
d <- dplyr::select( d, keep.us )

# adjust 2000 home values for inflation 
mhv.00 <- d$mhmval00 * 1.28855  
mhv.10 <- d$mhmval12

# change in MHV in dollars
mhv.change <- mhv.10 - mhv.00

# remove cases that are less than $1000
mhv.00[ mhv.00 < 1000 ] <- NA

# change in MHV in percent
mhv.growth <- 100 * ( mhv.change / mhv.00 )

# omit cases with growth rates above 200%
mhv.growth[ mhv.growth > 200 ] <- NA

# add variables to the dataframe
d$mhv.00 <- mhv.00
d$mhv.10 <- mhv.10
d$mhv.change <- mhv.change
d$mhv.growth <- mhv.growth 

# recode the tract ids to numbers that match the LTDB
x <- d$tractid 
x <- gsub( "fips", "", x )
x <- gsub( "-", "", x )
x <- sub( ".","", x )

# add the recoded tract id
d$tractid2 <- x 

# Merge the plot with the data needed for the plot
sdd.dat <- merge( 
  sd_dorling, d, by.x="GEOID", by.y="tractid2", all.x=TRUE )

Step 4: Saving the Dorling Cartogram to File


# project to standard lat-lon coordinate system
sdd.dat <- spTransform( sdd.dat, CRS("+proj=longlat +datum=WGS84") )

# write to a file
path <- "C:\\Users\\swest\\Desktop\\Grad School\\Spring 2024\\PAF 516\\Module 4 -"

geojson_write( 
  sdd.dat, 
  file=paste( path, "sd_dorling_solutions.geojson", sep="" ), 
  geometry="polygon" )
#load dorling cartogram
# from local file path
sd <- geojson_read( "C:\\Users\\swest\\Desktop\\Grad School\\Spring 2024\\PAF 516\\Final Project\\sd_dorling_solutions.geojson", what="sp" )

sd2 <- spTransform(sd, CRS("+init=epsg:3395"))

sd.sf<- st_as_sf(sd2)

d<- as.data.frame(sd.sf)

#current_bbox <- st_box(geo_sf)   **Suuspect bb error was cause for object object error on dashboard - GPT suggestion

#new_bbox<- c(xmin = -117.346069, xmax = -116.815979, ymin = 32.516977, ymax = 32.912021 )  **Suuspect bb error was cause for object object error on dashboard - GPT suggestion

#sd.sf<-new_bbox  **Suuspect bb error was cause for object object error on dashboard - GPT suggestion

#bb <- st_bbox( c( xmin = -117.346069, xmax = -116.815979, 
               #   ymax = 32.912021, ymin = 32.516977 ), 
              # crs = st_crs("+init=epsg:4326"))  **neighborhoods tab looks fine without bbbox.

Community Demographics
=====================================

Inputs {.sidebar}

these.variables <- c("pnhwht12", "pnhblk12", "phisp12", "pntv12", "pfb12", "polang12", 
"phs12", "pcol12", "punemp12", "pflabf12", "pprof12", "pmanuf12", 
"pvet12", "psemp12", "hinc12", "incpc12", "ppov12", "pown12", 
"pvac12", "pmulti12", "mrent12", "mhmval12", "p30old12", "p10yrs12", 
"p18und12", "p60up12", "p75up12", "pmar12", "pwds12", "pfhh12")

dd.name <- c("Percent white, non-Hispanic", 
                 "Percent black, non-Hispanic", "Percent Hispanic", 
                 "Percent Native American race", "Percent foreign born", 
                 "Percent speaking other language at home, age 5 plus", 
                 "Percent with high school degree or less", 
                 "Percent with 4-year college degree or more", 
                 "Percent unemployed", "Percent female labor force participation", 
                 "Percent professional employees", 
                 "Percent manufacturing employees", 
                 "Percent veteran", "Percent self-employed", 
                 "Median HH income, total", "Per capita income", 
                 "Percent in poverty, total", "Percent owner-occupied units", 
                 "Percent vacant units", "Percent multi-family units", 
                 "Median rent", "Median home value", 
                 "Percent structures more than 30 years old",
                 "Percent HH in neighborhood 10 years or less", 
                 "Percent 17 and under, total", "Percent 60 and older, total",
                 "Percent 75 and older, total", 
                 "Percent currently married, not separated", 
                 "Percent widowed, divorced and separated", 
                 "Percent female-headed families with children")

name_mapping <- setNames(dd.name, these.variables)

choice_names = as.vector(name_mapping[these.variables])

# replace these with descriptive labels 
# from the data dictionary 
#temp.names <- paste0( "Variabls", these.variables )

radioButtons( inputId="demographics", 
              label = h3("Census Variables"),
              # choices = these.variables, 
              choiceNames=dd.name,
              choiceValues=these.variables,
              selected="pnhwht12")

# Adding interpretable variable names
# from the data dictionary:
# add a name attribute for each variable
# 
# value <- c(1,2,3)
# dd.name <- c("one","two","three")
# 
# x <- dd.name
# names(x) <- value
#
# dd names and values linked
# names( x[2] )
#
# can now get the label using the value
# using the name attributes 
# x[ "two" ]
#
# to add labels to the maps
# use the radio button value 
# to get the data dictionary label: 
#
# x[ input$demographics ]      

Row {.tabset}

Choropleth Map


renderPlot({

# split the selected variable into deciles 

get_data <- 
  reactive({
             sd.sf <- 
             sd.sf %>% 
             mutate( q = ntile( get(input$demographics), 10 ) )  
          })

ggplot( get_data() ) +
    geom_sf( aes( fill = q ), color=NA ) +
    coord_sf( datum=NA ) +
    labs( title = paste0( "Choropleth of Select Demographics: ", toupper(input$demographics) ),
          caption = "Source: Harmonized Census Files",
          fill = "Population Deciles" ) +
    scale_fill_gradientn( colours=rev(ocean.balance(10)), guide = "colourbar" ) # + 
   # xlim( xmin = -13149614.8500, xmax=-13004231.6222 ) + 
   # ylim( ymin = 3831232.1981, ymax = 3886555.9166 )
  #Hid the above lines for x and y lims because it kept generating an off map. Removing it helped center it automatically without noticable issues. 

})

Variable Distribution

renderPlot({

# extract vector x from the data frame 
# x <-  d[ "pnhwht12" ] %>% unlist()

get_variable_x <- reactive({ d[ input$demographics ] })

x <- get_variable_x() %>% unlist()

cut.points <- quantile( x, seq( 0, 1, 0.1 ) )

hist( x, breaks=50, 
      col="gray", border="white", yaxt="n",
      main=paste0( "Histogram of variable ", toupper( input$demographics ) ),
      xlab="red lines represent decile cut points" )

abline( v=cut.points, col="darkred", lty=3, lwd=2 )

})
AntJam-Howell commented 4 months ago

Hi, I will try to answer all of these comments. ideally, and going forward, please stick to posting one discrete question for each post.

Point 1: Is the RMD you are working with now the same one you sent me? The one you sent me I was unable to get the dashboard working. Looking at the dorling RMD, there was no code used to build any of the change variables.

AntJam-Howell commented 4 months ago

Point 2: "Either way, I still have no clue how to tie the temp.names to the dd.names and names(x) - I fail to see the connection."

The piece of code that I went line-by-line explaining what each function is doing is simply an abstract example to help students understand variable assignment, label, etc. and nothing to do with the final dashboard. Changing 1, 2, 3 to one, two and three is quite clearly unrelated to our project. It is only meant as an added learning that most people find useful, but it is clearly a stumbling block for you. As long as you understand variable assignment, indexing, labeling, etc. please just ignore the block of code, which is all commented out to indicate that it is an abstract example, not to be run in the dashboard directly.

As I explained before, when your dashboard is built, look at your variable labels. Then, substitute temp.names for choice_names (in the context below), rerun the dasboard and you will quickly see that Variable is added as a prefix to all of the variable names. Again, this is not important for the dasboard and temp.names will not be used. it is intended to show how easy it is to systematically add a constant to a large number of variable labels with paste function.

these.variables <- c("pnhwht12", "pnhblk12", "phisp12", "pntv12", "pfb12", "polang12", 
"phs12", "pcol12", "punemp12", "pflabf12", "pprof12", "pmanuf12", 
"pvet12", "psemp12", "hinc12", "incpc12", "ppov12", "pown12", 
"pvac12", "pmulti12", "mrent12", "mhmval12", "p30old12", "p10yrs12", 
"p18und12", "p60up12", "p75up12", "pmar12", "pwds12", "pfhh12")

var.names <- c( "Percent white, non-Hispanic", 
             "Percent black, non-Hispanic",
             "Percent Hispanic",                    
             "Percent Native American race", 
             "Percent foreign born", 
             "Percent speaking other language at home, age 5 plus", 
             "Percent with high school degree or less", 
             "Percent with 4-year college degree or more", 
             "Percent unemployed", "Percent female labor force participation", 
             "Percent professional employees",
             "Percent manufacturing employees", 
             "Percent veteran", "Percent self-employed", 
             "Median HH income, total",
             "Per capita income", 
             "Percent in poverty, total", 
             "Percent owner-occupied units",
             "Percent vacant units", 
             "Percent multi-family units", 
             "Median rent",
             "Median home value", 
             "Percent structures more than 30 years old",
             "Percent HH in neighborhood 10 years or less", 
             "Percent 17 and under, total", 
             "Percent 60 and older, total",
             "Percent 75 and older, total", 
             "Percent currently married, not separated", 
             "Percent widowed, divorced and separated", 
             "Percent female-headed families with children" )

# replace these with descriptive labels 
# from the data dictionary 
temp.names <- paste0( "Variable-", var.names ) # This is absoltuley not necessary to do, it is only a suggestion to show you how to add a constant change to all variables.

radioButtons( inputId="demographics", 
              label = h3("Census Variables"),
              choiceNames=var.names,
              choiceValues=these.variables,
              selected="pnhwht12")
AntJam-Howell commented 4 months ago

Point 3: see my point 1 reply above. In the RMD you sent me used to create the dorling, I did not find any code required to calculate any of the change variables.

AntJam-Howell commented 4 months ago

Point 4: see my point 1 reply above. In the RMD you sent me used to create the dorling, I did not find any code required to calculate any of the change variables.

AntJam-Howell commented 4 months ago

Point 5: R typically does a good job producing the map without having to specify the coordinates. With that said, you should us the locator() function (PC) or another mapping source like bbox finder (if using Mac) to get the bounding box. If on the bbox finder website, you can choose pseudo-mercator, a type of map projection that is typically relied on for map visualization. next draw the bounding box and input the xlim/xmax and ylim/ymax. I use the following coordinates:

ggplot( get_data() ) +
    geom_sf( aes( fill = q ), color=NA ) +
    coord_sf( datum=NA ) +
    labs( title = paste0( "Spatial Distribution of Home Values: ", toupper(input$demographics) ),
          caption = "Source: Harmonized Census Files",
          fill = "Home Value Deciles" ) +
    scale_fill_gradientn( 
      colours=rev( ocean.balance( 10 ) ), guide = "colourbar" ) + 
    xlim( xmin = -13071745, xmax = -12980948 ) + 
    ylim( ymin = 3811797, ymax = 3931462 )

}) 

Those same coordinates can also be used to define the bbox as:

bb <- st_bbox( c( xmin = -13071745, xmax = -12980948, 
                  ymax = 3811797, ymin = 3931462 ), 
               crs = st_crs("+init=epsg:3395"))

Coordinate Reference System (CRS) The bounding box is also associated with a coordinate reference system, defined by the crs parameter. The st_crs function assigns a specific CRS to the bounding box. "+init=epsg:3395" refers to the EPSG code for a specific map projection. EPSG 3395 is known as the "World Mercator" projection, a global variant of the Mercator projection.

Now, you would need to do a separate sequence of courses to study in-depth all of the GIS/mapping/coordinate reference systems, etc. so you are not necessarily going to obtain expert domain knowledge and understand every single angle here (that goes with various other snipets of code/functions/tools/methods related to this class. This is a catch-all class drawing from various domains, and requires excellent troubleshooting skills to solve things that may be completely unfamiliar to you).

With that said, if you want additional information about the difference between the different map projections you see on bbox finder, please read below for reference:

In ggplot2 with geom_sf, the choice between Mercator and WGS (World Geodetic System) projections depends on your data's geographic context and intended visualization purpose. Here's an explanation to help you choose the right projection:

geom_sf is part of the ggplot2 package for visualizing spatial data in R. It allows you to plot geographical data (like maps) with ease, thanks to its support for simple features (hence "sf").

WGS, typically WGS84, uses a geographic coordinate system based on latitude and longitude. It represents points on the Earth's surface in a spherical or ellipsoidal way. Ideal for storing and exchanging geographic data but less suited for direct visualization due to distortions when projected onto a flat surface.

Mercator Projection: A common map projection that transforms spherical coordinates into a rectangular map. It preserves angles and shapes at small scales but distorts areas, especially as you move further from the equator. Popular for navigation and certain web maps (like Google Maps), but it can exaggerate the size of regions near the poles.

Use WGS for Data Preparation: If you're manipulating raw geographic data, WGS is the best format for its simplicity and global standardization. It's a common choice for input data before any projection is applied.

Use Mercator for Visualization: If you're plotting a map or other visual representation, Mercator is a common choice because it translates well to a flat, rectangular surface, making it easier to interpret. It's suitable for map-based data visualizations, especially when you're not focusing on the relative areas of different regions.

swest235 commented 4 months ago

@AntJam-Howell thank you for the thorough explanations.

To reiterate, I shared in my comment that I am using the dorling solutions file and I am still experiencing some of the issues I mentioned. I'll repost them below now that we are on a shared understanding I am using the correct dorling.

Community Demographics: a) I am running into MHV variable producing an error object object in my choropleth and variable distribution tabs. All other variables seem to work fine, but this one specifically generates the errors. (again, I am using the solutions dorling you shared ). *EDIT: In case anyone else had this issue, it had to do with the steps of the dorling file creation - there were identical cases of mhval12 so when the merge happened it kept both but one was given a .x and the other a .y. If you just include .x or .y on your mhval12 variable in your these.variables it should do the trick. **

NH Change 2000-2010: a) variables mhv 2000 & mhv 2010 produce error object object for choropleth, distribution tabs says undefined columns b) variable value change 2000-2010 seems to work on both map and distributions c) growth in home value produces a map, but gives me Error: missing values and NaN's not allowed if 'na.rm' is FALSE.

Could you please respond to this question from my email that you hadn't addressed too, please? "How exactly are we to go about coming up with names for the cluster tab? In the lab where we had to come up with names we had a whole boatload of those charts (can't recall their names) that gave us tons of points on where the groups landed in relation to the census variables we used. All I see in my cluster page is the coded clusters but no reference to what is what. I'm likely just being dumb and completely missing something that should be right in my face, but I can't gather what it is or how to go about doing that part accurately. Any help on this?"

Dorling Used:


# libraries used to build the file

library( geojsonio )   # read shapefiles
library( sp )          # work with shapefiles
library( sf )          # work with shapefiles - simple features format
library( mclust )      # cluster analysis 
library( tmap )        # theme maps
library( ggplot2 )     # graphing 
library( ggthemes )    # nice formats for ggplots
library( dplyr )       # data wrangling 
library( pander )      # formatting RMD tables
library( tidycensus )
library( cartogram )  # spatial maps w/ tract size bias reduction
#library( maptools )   # spatial object manipulation 

# clear the workspace
rm( list = ls() )

# set the api key
census_api_key( "8f1ce150e65b8cba01951fbcbbe65ebbb9409638" )

Step 1: Create the Dorling Cartogram


# get the crosswalk data
crosswalk <- read.csv( "https://raw.githubusercontent.com/DS4PS/cpp-529-master/master/data/cbsatocountycrosswalk.csv",  stringsAsFactors=FALSE, colClasses="character" )

# selector variable for San Diego
these.msp <- 
  crosswalk$msaname == grep( "^SAN DIEGO", crosswalk$msaname, value=TRUE ) 

# take just San Diego
these.fips <- crosswalk$fipscounty[ these.msp ]
these.fips <- na.omit( these.fips )

# state and county fips
state.fips <- substr( these.fips, 1, 2 )
county.fips <- substr( these.fips, 3, 5 )

# get the population data
sd.pop <- get_acs( 
  geography = "tract", 
  variables = "B01003_001", 
  state = state.fips, 
  county = county.fips,
  geometry = TRUE ) %>% 
  dplyr::select( GEOID, estimate ) %>%
  dplyr::rename( POP=estimate )

# recode the GEIOD variable to conform with the census data
# remove the leading zero
sd.pop$GEOID<-sub( ".","", sd.pop$GEOID )

# add the census data
URL <- "https://github.com/DS4PS/cpp-529-master/raw/master/data/ltdb_std_2010_sample.rds"
census.dat <- readRDS( gzcon( url( URL ) ) )

# merge the pop data for Albuquerque with the census data
sdd <- merge( sd.pop, census.dat, by.x="GEOID", by.y="tractid" )

# make sure there are no empty polygons
sdd <- sdd[ ! st_is_empty( sdd ) , ]

# convert sf map object to an sp version
sdd.sp <- as_Spatial( sdd )

# project map and remove empty tracts
sdd.sp <- spTransform( sdd.sp, CRS( "+init=epsg:3395" ) )
sdd.sp <- sdd.sp[ sdd.sp$POP != 0 & (! is.na( sdd.sp$POP ) ) , ]

# standardizes it to max of 1.12
sdd.sp$pop.w <- sdd.sp$POP / 4000 

# convert census tract polygons to dorling cartogram
sd_dorling <- cartogram_dorling( x=sdd.sp, weight="pop.w", k=0.05 )

Step 2: Add Clusters


# define the variables we want for the cluster analysis
keep.these <- c("pnhwht12", "pnhblk12", "phisp12", "pntv12", "pfb12", "polang12", 
"phs12", "pcol12", "punemp12", "pflabf12", "pprof12", "pmanuf12", 
"pvet12", "psemp12", "hinc12", "incpc12", "ppov12", "pown12", 
"pvac12", "pmulti12", "mrent12", "mhmval12", "p30old12", "p10yrs12", 
"p18und12", "p60up12", "p75up12", "pmar12", "pwds12", "pfhh12")

# reduce the data structure for the cluster analysis
d1 <- sd_dorling@data
d2 <- dplyr::select( d1, keep.these )

# standardize the variables
d3 <- apply( d2, 2, scale )

# estimate the clusters
set.seed( 1234 )
fit <- Mclust( d3 )
sd_dorling$cluster <- as.factor( fit$classification )

Step 3: Add Census Data


URL1 <- "https://github.com/DS4PS/cpp-529-fall-2020/raw/main/LABS/data/rodeo/LTDB-2000.rds"
d1 <- readRDS( gzcon( url( URL1 ) ) )

URL2 <- "https://github.com/DS4PS/cpp-529-fall-2020/raw/main/LABS/data/rodeo/LTDB-2010.rds"
d2 <- readRDS( gzcon( url( URL2 ) ) )

URLmd <- "https://github.com/DS4PS/cpp-529-fall-2020/raw/main/LABS/data/rodeo/LTDB-META-DATA.rds"
md <- readRDS( gzcon( url( URLmd ) ) )

d1 <- dplyr::select( d1, - year )
d2 <- dplyr::select( d2, - year )

d <- merge( d1, d2, by="tractid" )
d <- merge( d, md, by="tractid" )

# filter rural census tracts
d <- filter( d, urban == "urban" )

# keep variables you want for the merge
keep.us <- c( "tractid", "mhmval00", "mhmval12" )
d <- dplyr::select( d, keep.us )

# adjust 2000 home values for inflation 
mhv.00 <- d$mhmval00 * 1.28855  
mhv.10 <- d$mhmval12

# change in MHV in dollars
mhv.change <- mhv.10 - mhv.00

# remove cases that are less than $1000
mhv.00[ mhv.00 < 1000 ] <- NA

# change in MHV in percent
mhv.growth <- 100 * ( mhv.change / mhv.00 )

# omit cases with growth rates above 200%
mhv.growth[ mhv.growth > 200 ] <- NA

# add variables to the dataframe
d$mhv.00 <- mhv.00
d$mhv.10 <- mhv.10
d$mhv.change <- mhv.change
d$mhv.growth <- mhv.growth 

# recode the tract ids to numbers that match the LTDB
x <- d$tractid 
x <- gsub( "fips", "", x )
x <- gsub( "-", "", x )
x <- sub( ".","", x )

# add the recoded tract id
d$tractid2 <- x 

# Merge the plot with the data needed for the plot
sdd.dat <- merge( 
  sd_dorling, d, by.x="GEOID", by.y="tractid2", all.x=TRUE )

Step 4: Saving the Dorling Cartogram to File


# project to standard lat-lon coordinate system
sdd.dat <- spTransform( sdd.dat, CRS("+proj=longlat +datum=WGS84") )

# write to a file
path <- "C:\\Users\\swest\\Desktop\\Grad School\\Spring 2024\\PAF 516\\Module 4 -"

geojson_write( 
  sdd.dat, 
  file=paste( path, "sd_dorling_solutions.geojson", sep="" ), 
  geometry="polygon" )
AntJam-Howell commented 4 months ago

Hi Ok good to know you are working on the solutions rmd and not the one you emailed me.

In that case please send me the final dashboard rmd most up to date along with the Bronson of the correct dorling.

-- Anthony Howell Director, Center for Technology, Data, and Society, Associate Professor, Public Policy Arizona State University

On Thu, Apr 25, 2024 at 2:44 PM swest235 @.***> wrote:

@AntJam-Howell https://urldefense.com/v3/__https://github.com/AntJam-Howell__;!!IKRxdwAv5BmarQ!bMJvYRmMZ3I3C_q1D0W9xDPRHFqImcxK6OK2M8k8rOeEuQ2_066Fvxx5bj3jwL6r4_akZqe03p29pClrw9_wUWvKzAV_jg$ thank you for the thorough explanations.

To reiterate, I shared in my comment that I am using the dorling solutions file (not my original attempt at creating the dorling) and I am still experiencing some of the issues I mentioned. I'll repost them below now that we are on a shared understanding I am using the correct dorling.

Community Demographics: a) I am running into MHV variable producing an error object object in my choropleth and variable distribution tabs. All other variables seem to work fine, but this one specifically generates the errors. (again, I am using the solutions dorling you shared ).

NH Change 2000-2010: a) variables mhv 2000 & mhv 2010 produce error object object for choropleth, distribution tabs says undefined columns b) variable value change 2000-2010 seems to work on both map and distributions c) growth in home value produces a map, but gives me Error: missing values and NaN's not allowed if 'na.rm' is FALSE.

Dorling Used:

libraries used to build the file

library( geojsonio ) # read shapefiles library( sp ) # work with shapefiles library( sf ) # work with shapefiles - simple features format library( mclust ) # cluster analysis library( tmap ) # theme maps library( ggplot2 ) # graphing library( ggthemes ) # nice formats for ggplots library( dplyr ) # data wrangling library( pander ) # formatting RMD tables library( tidycensus ) library( cartogram ) # spatial maps w/ tract size bias reduction

library( maptools ) # spatial object manipulation

clear the workspace

rm( list = ls() )

set the api key

census_api_key( "8f1ce150e65b8cba01951fbcbbe65ebbb9409638" )

Step 1: Create the Dorling Cartogram

get the crosswalk data

crosswalk <- read.csv( "https://raw.githubusercontent.com/DS4PS/cpp-529-master/master/data/cbsatocountycrosswalk.csv https://urldefense.com/v3/__https://raw.githubusercontent.com/DS4PS/cpp-529-master/master/data/cbsatocountycrosswalk.csv__;!!IKRxdwAv5BmarQ!bMJvYRmMZ3I3C_q1D0W9xDPRHFqImcxK6OK2M8k8rOeEuQ2_066Fvxx5bj3jwL6r4_akZqe03p29pClrw9_wUWvFb8-y4A$", stringsAsFactors=FALSE, colClasses="character" )

selector variable for San Diego

these.msp <- crosswalk$msaname == grep( "^SAN DIEGO", crosswalk$msaname, value=TRUE )

take just San Diego

these.fips <- crosswalk$fipscounty[ these.msp ] these.fips <- na.omit( these.fips )

state and county fips

state.fips <- substr( these.fips, 1, 2 ) county.fips <- substr( these.fips, 3, 5 )

get the population data

sd.pop <- get_acs( geography = "tract", variables = "B01003_001", state = state.fips, county = county.fips, geometry = TRUE ) %>% dplyr::select( GEOID, estimate ) %>% dplyr::rename( POP=estimate )

recode the GEIOD variable to conform with the census data

remove the leading zero

sd.pop$GEOID<-sub( ".","", sd.pop$GEOID )

add the census data

URL <- "https://github.com/DS4PS/cpp-529-master/raw/master/data/ltdb_std_2010_sample.rds https://urldefense.com/v3/__https://github.com/DS4PS/cpp-529-master/raw/master/data/ltdb_std_2010_sample.rds__;!!IKRxdwAv5BmarQ!bMJvYRmMZ3I3C_q1D0W9xDPRHFqImcxK6OK2M8k8rOeEuQ2_066Fvxx5bj3jwL6r4_akZqe03p29pClrw9_wUWvOZHQ_3Q$" census.dat <- readRDS( gzcon( url( URL ) ) )

merge the pop data for Albuquerque with the census data

sdd <- merge( sd.pop, census.dat, by.x="GEOID", by.y="tractid" )

make sure there are no empty polygons

sdd <- sdd[ ! st_is_empty( sdd ) , ]

convert sf map object to an sp version

sdd.sp <- as_Spatial( sdd )

project map and remove empty tracts

sdd.sp <- spTransform( sdd.sp, CRS( "+init=epsg:3395" ) ) sdd.sp <- sdd.sp[ sdd.sp$POP != 0 & (! is.na( sdd.sp$POP ) ) , ]

standardizes it to max of 1.12

sdd.sp$pop.w <- sdd.sp$POP / 4000

convert census tract polygons to dorling cartogram

sd_dorling <- cartogram_dorling( x=sdd.sp, weight="pop.w", k=0.05 )

Step 2: Add Clusters

define the variables we want for the cluster analysis

keep.these <- c("pnhwht12", "pnhblk12", "phisp12", "pntv12", "pfb12", "polang12", "phs12", "pcol12", "punemp12", "pflabf12", "pprof12", "pmanuf12", "pvet12", "psemp12", "hinc12", "incpc12", "ppov12", "pown12", "pvac12", "pmulti12", "mrent12", "mhmval12", "p30old12", "p10yrs12", "p18und12", "p60up12", "p75up12", "pmar12", "pwds12", "pfhh12")

reduce the data structure for the cluster analysis

d1 <- @.***

d2 <- dplyr::select( d1, keep.these )

standardize the variables

d3 <- apply( d2, 2, scale )

estimate the clusters

set.seed( 1234 ) fit <- Mclust( d3 ) sd_dorling$cluster <- as.factor( fit$classification )

Step 3: Add Census Data

URL1 <- "https://github.com/DS4PS/cpp-529-fall-2020/raw/main/LABS/data/rodeo/LTDB-2000.rds https://urldefense.com/v3/__https://github.com/DS4PS/cpp-529-fall-2020/raw/main/LABS/data/rodeo/LTDB-2000.rds__;!!IKRxdwAv5BmarQ!bMJvYRmMZ3I3C_q1D0W9xDPRHFqImcxK6OK2M8k8rOeEuQ2_066Fvxx5bj3jwL6r4_akZqe03p29pClrw9_wUWsxl-xczg$" d1 <- readRDS( gzcon( url( URL1 ) ) )

URL2 <- "https://github.com/DS4PS/cpp-529-fall-2020/raw/main/LABS/data/rodeo/LTDB-2010.rds https://urldefense.com/v3/__https://github.com/DS4PS/cpp-529-fall-2020/raw/main/LABS/data/rodeo/LTDB-2010.rds__;!!IKRxdwAv5BmarQ!bMJvYRmMZ3I3C_q1D0W9xDPRHFqImcxK6OK2M8k8rOeEuQ2_066Fvxx5bj3jwL6r4_akZqe03p29pClrw9_wUWt8TRxprg$" d2 <- readRDS( gzcon( url( URL2 ) ) )

URLmd <- "https://github.com/DS4PS/cpp-529-fall-2020/raw/main/LABS/data/rodeo/LTDB-META-DATA.rds https://urldefense.com/v3/__https://github.com/DS4PS/cpp-529-fall-2020/raw/main/LABS/data/rodeo/LTDB-META-DATA.rds__;!!IKRxdwAv5BmarQ!bMJvYRmMZ3I3C_q1D0W9xDPRHFqImcxK6OK2M8k8rOeEuQ2_066Fvxx5bj3jwL6r4_akZqe03p29pClrw9_wUWuXOR0O9Q$" md <- readRDS( gzcon( url( URLmd ) ) )

d1 <- dplyr::select( d1, - year ) d2 <- dplyr::select( d2, - year )

d <- merge( d1, d2, by="tractid" ) d <- merge( d, md, by="tractid" )

filter rural census tracts

d <- filter( d, urban == "urban" )

keep variables you want for the mergekeep.us <- c( "tractid", "mhmval00", "mhmval12" )

d <- dplyr::select( d, keep.us )

adjust 2000 home values for inflation

mhv.00 <- d$mhmval00 * 1.28855 mhv.10 <- d$mhmval12

change in MHV in dollars

mhv.change <- mhv.10 - mhv.00

remove cases that are less than $1000

mhv.00[ mhv.00 < 1000 ] <- NA

change in MHV in percent

mhv.growth <- 100 * ( mhv.change / mhv.00 )

omit cases with growth rates above 200%

mhv.growth[ mhv.growth > 200 ] <- NA

add variables to the dataframe

d$mhv.00 <- mhv.00 d$mhv.10 <- mhv.10 d$mhv.change <- mhv.change d$mhv.growth <- mhv.growth

recode the tract ids to numbers that match the LTDB

x <- d$tractid x <- gsub( "fips", "", x ) x <- gsub( "-", "", x ) x <- sub( ".","", x )

add the recoded tract id

d$tractid2 <- x

Merge the plot with the data needed for the plot

sdd.dat <- merge( sd_dorling, d, by.x="GEOID", by.y="tractid2", all.x=TRUE )

Step 4: Saving the Dorling Cartogram to File

project to standard lat-lon coordinate system

sdd.dat <- spTransform( sdd.dat, CRS("+proj=longlat +datum=WGS84") )

write to a file

path <- "C:\Users\swest\Desktop\Grad School\Spring 2024\PAF 516\Module 4 -"

geojson_write( sdd.dat, file=paste( path, "sd_dorling_solutions.geojson", sep="" ), geometry="polygon" )

— Reply to this email directly, view it on GitHub https://urldefense.com/v3/__https://github.com/DS4PS/cpp-529-spr-2021/issues/35*issuecomment-2078192538__;Iw!!IKRxdwAv5BmarQ!bMJvYRmMZ3I3C_q1D0W9xDPRHFqImcxK6OK2M8k8rOeEuQ2_066Fvxx5bj3jwL6r4_akZqe03p29pClrw9_wUWt6n_5ypQ$, or unsubscribe https://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/AMK2Y74AP3BVATL2Y7DGQDTY7FXJ7AVCNFSM6AAAAABGWOWJKKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANZYGE4TENJTHA__;!!IKRxdwAv5BmarQ!bMJvYRmMZ3I3C_q1D0W9xDPRHFqImcxK6OK2M8k8rOeEuQ2_066Fvxx5bj3jwL6r4_akZqe03p29pClrw9_wUWs42jwKXQ$ . You are receiving this because you were mentioned.Message ID: @.***>