DS4PS / cpp-529-fall-2020

http://ds4ps.org/cpp-529-fall-2020/
0 stars 0 forks source link

Input$WORD - Errors #31

Open MeghanPaquette opened 3 years ago

MeghanPaquette commented 3 years ago

Hi @lecy Dr. Lecy,

I think I am almost there! I am just a little stuck on the input variables not running in other chunks. There may be a larger error, but I have tried a couple of things and they didn't work. I attached two photos from the output of the dashboard, and then the code chunks related to the error. Meghan

89AB3FEF-4013-489C-8908-A23DF3855448_1_105_c

5BBB33C9-1887-47FE-A436-EC686C224ECC_1_105_c

`

these.variables <- c("pnhwht12", "pnhblk12", "phisp12", "pntv12", "pfb12", "polang12", 
"phs12", "pcol12", "punemp12", "pflabf12", "pprof12", "pmanuf12", 
"pvet12", "psemp12", "hinc12", "incpc12", "ppov12", "pown12", 
"pvac12", "pmulti12", "mrent12", "mhmval12", "p30old12", "p10yrs12", 
"p18und12", "p60up12", "p75up12", "pmar12", "pwds12", "pfhh12")

value <- c("pnhwht12", "pnhblk12", "phisp12", 
"pntv12", "pfb12", "polang12", "phs12", "pcol12", "punemp12", 
"pflabf12", "pprof12", "pmanuf12", "pvet12", "psemp12", "hinc12", 
"incpc12", "ppov12", "pown12", "pvac12", "pmulti12", "mrent12", 
"mhmval12", "p30old12", "p10yrs12", "p18und12", "p60up12", "p75up12", 
"pmar12", "pwds12", "pfhh12")

dd.name <- c("Percent white, non-Hispanic", 
"Percent black, non-Hispanic", "Percent Hispanic", "Percent Native American race", 
"Percent foreign born", "Percent speaking other language at home, age 5 plus", 
"Percent with high school degree or less", "Percent with 4-year college degree or more", 
"Percent unemployed", "Percent female labor force participation", 
"Percent professional employees", "Percent manufacturing employees", 
"Percent veteran", "Percent self-employed", "Median HH income, total", 
"Per capita income", "Percent in poverty, total", "Percent owner-occupied units", 
"Percent vacant units", "Percent multi-family units", "Median rent", 
"Median home value", "Percent structures more than 30 years old", 
"Percent HH in neighborhood 10 years or less", "Percent 17 and under, total", 
"Percent 60 and older, total", "Percent 75 and older, total", 
"Percent currently married, not separated", "Percent widowed, divorced and separated", 
"Percent female-headed families with children")

x <- dd.name
names(x) <- value

temp.names <- paste0( x )

radioButtons( inputId="demographics", 
              label = h3("Census Variables"),
              # choices = these.variables, 
              choiceNames=temp.names,
              choiceValues=these.variables,
              selected="Percent white, non-Hispanic")

renderPlot({

# split the selected variable into deciles 

get_data <- 
  reactive({
             vegas.sf <- 
             vegas.sf %>% 
             mutate( q = ntile( get(input$demographics), 10 ) )  
          })

ggplot( get_data() ) +
    geom_sf( aes( fill = q ), color=NA ) +
    coord_sf( datum=NA ) +
    labs( title = paste0( "Choropleth of Select Demographics: ", toupper(input$demographics) ),
          caption = "Source: Harmonized Census Files",
          fill = "Population Deciles" ) +
    scale_fill_gradientn( colours=rev(ocean.balance(10)), guide = "colourbar" ) + 
    xlim( xmin = -12965489, xmax = -12666171 ) + 
    ylim( ymin = 4227911, ymax = 4352610 )

})

`

lecy commented 3 years ago

When you are building the widget you need to match the label users will see on the GUI with the actual variable name.

cbind( dd.name, value ) %>% knitr::kable()
dd.name value
Percent white, non-Hispanic pnhwht12
Percent black, non-Hispanic pnhblk12
Percent Hispanic phisp12
Percent Native American race pntv12
Percent foreign born pfb12
Percent speaking other language at home, age 5 plus polang12
Percent with high school degree or less phs12
Percent with 4-year college degree or more pcol12
Percent unemployed punemp12
Percent female labor force participation pflabf12
Percent professional employees pprof12
Percent manufacturing employees pmanuf12
Percent veteran pvet12
Percent self-employed psemp12
Median HH income, total hinc12
Per capita income incpc12
Percent in poverty, total ppov12
Percent owner-occupied units pown12
Percent vacant units pvac12
Percent multi-family units pmulti12
Median rent mrent12
Median home value mhmval12
Percent structures more than 30 years old p30old12
Percent HH in neighborhood 10 years or less p10yrs12
Percent 17 and under, total p18und12
Percent 60 and older, total p60up12
Percent 75 and older, total p75up12
Percent currently married, not separated pmar12
Percent widowed, divorced and separated pwds12
Percent female-headed families with children pfhh12

You need to align labels and values with the proper arguments:

          choiceNames=temp.names,
          choiceValues=these.variables,
radioButtons( inputId="demographics", 
              label = h3("Census Variables"),             
              choiceNames=temp.names,
              choiceValues=these.variables,
              selected="Percent white, non-Hispanic")

choiceNames, choiceValues |

List of names and values, respectively, that are displayed to the user in the app and correspond to the each choice (for this reason, choiceNames and choiceValues must have the same length).

lecy commented 3 years ago

Note that neighborhood change variables would need to be created as:

x.change <- x.2010 - x.2000

# all together
d <- 
d %>% 
  mutate( x1.change = x1.2010 - x1.2000, 
               x2.change = x2.2010 - x2.2000 )

Recall that the 2012 variables are really 2010 variables (2012 ACS five-year samples, which center at 2010).

MeghanPaquette commented 3 years ago

@lecy

I seem to still be getting the same error message on the dashboard for both. I am focusing on the first part. It is saying something about the mutate, so I can't tell if it is considering the input$demographics part, the vegas.sf part, or the whole thing is just wrong.

ERROR in First Tab:

Problem with mutate() input q. x object 'pnhwht12' not found ℹ Input q is ntile(get(input$demographics), 10).

ERROR in Second Tab: undefined columns selected

`


# DATA STEPS 

# from local file path
vegas <- geojson_read( "vegas_dorling.geojson", what="sp" )

plot( vegas )

# reproject the map 

vegas2 <- spTransform( vegas, CRS("+init=epsg:3395") )

# convert the sp map format to 
# an sf (simple features) format:
# ggmap requires the sf format
vegas.sf <- st_as_sf( vegas2 )

# separate out the data frame from the map
d <- as.data.frame( vegas.sf )

Community Demographics
=====================================

Inputs {.sidebar}

these.variables <- c("pnhwht12", "pnhblk12", "phisp12", "pntv12", "pfb12", "polang12", 
"phs12", "pcol12", "punemp12", "pflabf12", "pprof12", "pmanuf12", 
"pvet12", "psemp12", "hinc12", "incpc12", "ppov12", "pown12", 
"pvac12", "pmulti12", "mrent12", "mhmval12", "p30old12", "p10yrs12", 
"p18und12", "p60up12", "p75up12", "pmar12", "pwds12", "pfhh12")

value <- c("pnhwht12", "pnhblk12", "phisp12", 
"pntv12", "pfb12", "polang12", "phs12", "pcol12", "punemp12", 
"pflabf12", "pprof12", "pmanuf12", "pvet12", "psemp12", "hinc12", 
"incpc12", "ppov12", "pown12", "pvac12", "pmulti12", "mrent12", 
"mhmval12", "p30old12", "p10yrs12", "p18und12", "p60up12", "p75up12", 
"pmar12", "pwds12", "pfhh12")

dd.name <- c("Percent white, non-Hispanic", 
"Percent black, non-Hispanic", "Percent Hispanic", "Percent Native American race", 
"Percent foreign born", "Percent speaking other language at home, age 5 plus", 
"Percent with high school degree or less", "Percent with 4-year college degree or more", 
"Percent unemployed", "Percent female labor force participation", 
"Percent professional employees", "Percent manufacturing employees", 
"Percent veteran", "Percent self-employed", "Median HH income, total", 
"Per capita income", "Percent in poverty, total", "Percent owner-occupied units", 
"Percent vacant units", "Percent multi-family units", "Median rent", 
"Median home value", "Percent structures more than 30 years old", 
"Percent HH in neighborhood 10 years or less", "Percent 17 and under, total", 
"Percent 60 and older, total", "Percent 75 and older, total", 
"Percent currently married, not separated", "Percent widowed, divorced and separated", 
"Percent female-headed families with children")
x <- dd.name
names(x) <- value
cbind( dd.name, value ) %>% knitr::kable()
temp.names <- paste0( dd.name )

radioButtons( inputId="demographics", 
              label = h3("Census Variables"),
              choiceNames=temp.names,
              choiceValues=these.variables,
              selected="pnhwht12")

Row {.tabset}

Choropleth Map


renderPlot({

# split the selected variable into deciles 

get_data <- 
  reactive({
             vegas.sf <- 
             vegas.sf %>% 
             mutate( q = ntile( get(input$demographics), 10 ) )  
          })

ggplot( get_data() ) +
    geom_sf( aes( fill = q ), color=NA ) +
    coord_sf( datum=NA ) +
    labs( title = paste0( "Choropleth of Select Demographics: ", toupper(input$demographics) ),
          caption = "Source: Harmonized Census Files",
          fill = "Population Deciles" ) +
    scale_fill_gradientn( colours=rev(ocean.balance(10)), guide = "colourbar" ) + 
    xlim( xmin = -12965489, xmax = -12666171 ) + 
    ylim( ymin = 4227911, ymax = 4352610 )

})

Variable Distribution

renderPlot({

# extract vector x from the data frame 
x <-  d[ "pnhwht12" ] %>% unlist()

get_variable_x <- reactive({ d[ input$demographics ] })

x <- get_variable_x() %>% unlist()

cut.points <- quantile( x, seq( 0, 1, 0.1 ) )

hist( x, breaks=50, 
      col="gray", border="white", yaxt="n",
      main=paste0( "Histogram of variable ", toupper( input$demographics ) ),
      xlab="red lines represent decile cut points" )

abline( v=cut.points, col="darkred", lty=3, lwd=2 )

})

`

lecy commented 3 years ago

Which variables are in vegas.sf?

names( vegas.sf ) %>% sort() 
JasonSills commented 3 years ago

Hi, I was having a similar issue and ran

names( sea.sf ) %>% sort()

Below is are the variables in my sea.sf dataset. It looks like I'm missing the ones in these.variables <c (...) in the template. Do I need to recreate all of these? Or did something happen in the merge for the datafile? image

lecy commented 3 years ago

Yes, you should add all the variables you need in the data steps.

For the demographic / choropleth tabs the actual statistics are better than z-scores.

You don't have to stick to variables used in the demo template if you think others would be better.

Pro tip - if you want to create a variable list using existing variables try:

dput( names( dat ) )

Which will return a character vector instead of just print names out.

e.g.

c("x1","x2","x3")
JasonSills commented 3 years ago

Hi @lecy,

That worked and my dashboard is close, but I'm receiving an error in variable distribution in community demographics and neighborhood change and not seeing the map in values. I suspect it's from the same error below. I'm assuming this is an issue in my data, not in my code. It looks like there are NaN values somewhere in the data. From a statistical integrity perspective what is the best way to handle these? I don't know if I should replace with 0, etc. From a code perspective what is the fix? image image image

JasonSills commented 3 years ago

@lecy Thinking about it more, I think I see what is happening. When I remove the outliers in the census dataset and join to the dorling dataset there will be missing values. I tried sea[complete.cases(sea), ], but it returned an error. What code should I use to remove rows with NA values?

UPDATE: I've also tried na.omit(sea), but the rows with NA are still present.

In my merge I tried shifting from a left join (all.x=T) to a right join (all.y=T) and it had no effect. This one is particularly perplexing. You can see very clearly that there is an NA in the tractid.

image

lecy commented 3 years ago

You can approach it in three ways:

(1) Create a data subset for a specific purpose, like the data for the clustering models where you select specific variables. In this case a subset of data for a specific tab that is limited to the user options specified in the widget.

You can then remove ALL rows that have ANY missing values:

d2 <- na.omit( d1 )

This is pretty heavy-handed, though, so it just depends on how much missing data you have.

If you have one variable with lots of missing values, for example, it might mean that you drop half of your dataset because of that one variable.

(2) Select a variable, then omit missing cases for that variable.

Much more conservative approach that will minimize the amount of data dropped:

v1 <- d$v1
v2 <- na.omit( v1 )

(3) Data imputation

Probably the most complicated approach, and I would recommend using this only if you were doing modeling where the sample size was important.

It is not a good idea to impute missing values using zero, though. Much better to use the mean. Something like:

v2 <- v1
v2[ is.na( v2 ) ] <- mean( v2, na.rm=T )
lecy commented 3 years ago

@JasonSills That is odd. Especially after changing the left join to a right join.

Can you write the data file to a RDS and attach it here?

d <- sea@data
saveRDS( d, "seattle.rds" )

It might be the case that they are stored as "NA" strings in a character vector, in which case not sure if the na.omit() or complete.cases() operators would work.

lecy commented 3 years ago

You should check to see if the data is empty in the census side before merging.

d[ d$tractid2 == "5302997011" ,  ] 

If the NAs are still present after a left join that means the tract ID exists but the associated data is all NAs. I seem to recall some rows like that in the LTDB database.

image

lecy commented 3 years ago

Or to get extent of missing data in the LTDB census data just try:

summary( d )

It should report the number of missing values per variable.

JasonSills commented 3 years ago

Here is the rds file, I had to zip it to attach it here. seattle.zip

lecy commented 3 years ago

No NAs in the census dataset. They are introduced during the merge step.

sum( is.na( d ) )
[1] 0
x <- unlist( d )
sum( grepl( "^NA$", x ) ) 
[1] 0

If you remove the all.x=TRUE and all.y=TRUE argument during merge() it will return the union of shared tract IDs. That would drop all of the missing tracts from your data.

Otherwise I'm not sure why the complete cases function would not work, except there are some cases with values of infinity in the census file. Maybe any of the non-finite numbers cause errors as well?

JasonSills commented 3 years ago

What I was thinking is that it has to do with filtering out urban and the functions for removing outliers mhv.00[ mhv.00 < 1000 ] <- NA

Omit cases with growth rates above 200%

mhv.growth[ mhv.growth > 200 ] <- NA If I'm removing some from the census data set, not from the dorling data set, and join on dorling I'll have the NA values. But with that it would seem it's a problem with every student's dataset, and I'm thinking I'm the only one with this issue.

JasonSills commented 3 years ago

Hi @lecy

Some minor success. I went back and removed the urban filter and outliers and I now have variable distributions. I also found the bug in the median home values, so the variable distributions are working in both. Seems to support my above hypothesis.

Unfortunately this did not update my choropleth maps. It's a blank white sheet. My map for my clusters is working, so I don't think it's my xmins or ymins coordinates.

Wondering if it could be my code:

these.variables <- c("pnhwht12", "pnhblk12", "phisp12", "pntv12", "pfb12", "polang12", 
"phs12", "pcol12", "punemp12", "pflabf12", "pprof12", "pmanuf12", 
"pvet12", "psemp12", "hinc12", "incpc12", "ppov12", "pown12", 
"pvac12", "pmulti12", "mrent12", "mhmval12", "p30old12", "p10yrs12", 
"p18und12", "p60up12", "p75up12", "pmar12", "pwds12", "pfhh12")

value <- c("pnhwht12", "pnhblk12", "phisp12", 
"pntv12", "pfb12", "polang12", "phs12", "pcol12", "punemp12", 
"pflabf12", "pprof12", "pmanuf12", "pvet12", "psemp12", "hinc12", 
"incpc12", "ppov12", "pown12", "pvac12", "pmulti12", "mrent12", 
"mhmval12", "p30old12", "p10yrs12", "p18und12", "p60up12", "p75up12", 
"pmar12", "pwds12", "pfhh12")

dd.name <- c("White, non-Hispanic", 
"Black, non-Hispanic", "Hispanic", "Native American", 
"Foreign born", "Speak other language at home, age 5 plus", 
"High school degree or less", "4-year college degree or more", 
"Unemployed", " Female labor force participation", 
"Professional employees", "Manufacturing employees", 
"Veteran", "Self-employed", "Median HH income, total", 
"Per capita income", "Poverty", "Owner-occupied units", 
"Vacant units", "Multi-family units", "Median rent", 
"Median home value", "Structures more than 30 years old", 
"HH in neighborhood 10 years or less", "17 and under", 
"60 and older", "75 and older, total", 
"Currently married, not separated", "Widowed, divorced and separated", 
"Female-headed families with children")

x <- dd.name

names(x) <- value

temp.names <- paste0( dd.name )

radioButtons( inputId="demographics", 
              label = h3("Census Variables"),
              choiceNames=temp.names,
              choiceValues=these.variables,
              selected="pnhwht12")

renderPlot({

# split the selected variable into deciles 

get_data <- 
  reactive({
             sea.sf <- 
             sea.sf %>% 
             mutate( q = ntile( get(input$demographics), 10 ) )  
          })

ggplot( get_data() ) +
    geom_sf( aes( fill = q ), color=NA ) +
    coord_sf( datum=NA ) +
    labs( title = paste0( "Choropleth of Select Demographics: ", toupper(input$demographics) ),
          caption = "Source: Harmonized Census Files",
          fill = "Population Deciles" ) +
    scale_fill_gradientn( colours=rev(ocean.balance(10)), guide = "colourbar" ) + 
    xlim( xmin = -13647722, xmax = -13567392 ) + 
    ylim( ymin = 6084955, ymax = 5941032 )

})
lecy commented 3 years ago

Are the xlim and ylim values adjusted for Seattle?

 xlim( xmin = -13647722, xmax = -13567392 ) + 
    ylim( ymin = 6084955, ymax = 5941032 )

These should be the same as your bounding boxes in tmap.

JasonSills commented 3 years ago

Yes, that is what has me confused. The box looks like what I would expect from the values being off, but they are what I've used in other labs and the tmap.

Tmap: bb <- st_bbox( c( xmin = -13647722, xmax = -13567392, ymax = 6084955, ymin = 5941032 ),

JasonSills commented 3 years ago

Here is the link to my dashboard on shiny: https://sills-asu.shinyapps.io/CPP-529-Seattle-Final-Project-Sills/

lecy commented 3 years ago

If I comment out the x and y lims then your data appears on the map, so that's definitely the problem.

Take a look again at your ymin and ymax. See the issue?

MeghanPaquette commented 3 years ago

Yes!! It worked now!

The last two dashboard tabs are still having some issues, I think in regards to the variable names. I am looking at that as well, but if you see something specific feedback or anything would be appreciated.

Meghan

On Mon, Dec 7, 2020 at 1:59 AM Jesse Lecy notifications@github.com wrote:

If I comment out the x and y lims then your data appears on the map, so that's definitely the problem.

Take a look again at your ymin and ymax. See the issue?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/DS4PS/cpp-529-fall-2020/issues/31#issuecomment-739776861, or unsubscribe https://github.com/notifications/unsubscribe-auth/APR37BMKKEBU5AOY62ETPLDSTSKONANCNFSM4UOCWWQQ .

JasonSills commented 3 years ago

@lecy

Well, I figured it out. This was one of those things that I knew there was something in the xlim and ylim code, but just couldn't see. Hours and hours of trying all kinds of things. I even downloaded another student's source code and worked through that, to no avail. The issue? the ymax and ymin in my bbox above were switched, so I was inversing them.

lecy commented 3 years ago

@JasonSills I had to stare at it for 10 minutes before seeing it. It reminds me how you can't edit your writing in real-time for grammar because your brain won't let you see it until you walk away and come back to it.