DS4PS / cpp-526-spr-2020

Course shell for CPP 526 Foundations of Data Science I for Spring 2020.
http://ds4ps.org/cpp-526-spr-2020/
3 stars 0 forks source link

Code Through Help #33

Open karen-liz opened 4 years ago

karen-liz commented 4 years ago

Hi! I'm working on my Code Through assignment and I'm having a difficult time finding a way to plot certain variables. My dataset includes a variable of "Segment_description" which is a variety of descriptions such as "mobile respondents," "female respondents," and "Your parents make? $90K-$240k," and etc.

Here's the dataset: ( https://data.world/ahalps/which-social-media-millennials-care-about-most/workspace/query?queryid=sample-0 )

For this Code Through, I want to focus on gender so plotting the answers for "female respondents" and "male respondents." The problem I am having that I can't figure out how to plot the variable "Answer" on the x-axis and "Count" on the y-axis. The reason I am having an issue is that I can't figure out to extract these specific rows in these columns. I've tried using the operator "&" but for some reason, I can't figure out how to plot this. Any guidance would be greatly appreciated!

jamisoncrawford commented 4 years ago

The link requires a sign-in to access the data. Could you attach the CSV here with a truncated list of rows?

Alternatively, and as a best practice, could you provide a sample of the data you wish to plot in tabular format, or be recreating a sample of the data? (Unless the problem is in extracting it, of course!).

For example:

x <- data.frame(some_variable = c(5, 10, 20, 50, 100),
            another_variable = c("A", "C", "B", "A", "C"))
karen-liz commented 4 years ago

whatsgoodlydata10-which-social-media-millennials-care-about-most-QueryResult.xlsx

Okay, here's an xlsx file hoping this works!

Here's a screenshot if the xlsx doesn't work. gender

jamisoncrawford commented 4 years ago

You could use function filter() from package dplyr, like so:


library(readxl)
library(dplyr)

file <- "2020-02-29_soc-media_data.xlsx"

dat <- read_excel(file); rm(file)

dat %>%
  filter(segment_type == "Gender") %>%
  select(segment_type:percentage)

# A tibble: 8 x 5
  segment_type segment_description answer    count percentage
  <chr>        <chr>               <chr>     <dbl>      <dbl>
1 Gender       Female respondents  Instagram  1576      0.3  
2 Gender       Female respondents  Facebook    644      0.122
3 Gender       Female respondents  Snapchat   2967      0.564
4 Gender       Female respondents  Linkedin     73      0.014
5 Gender       Male respondents    Instagram  1008      0.24 
6 Gender       Male respondents    Facebook    565      0.135
7 Gender       Male respondents    Snapchat   2483      0.591
8 Gender       Male respondents    Linkedin    142      0.034

You could spruce it up a bit more with mutate() and a find and replace using gsub():

dat %>%
  filter(segment_type == "Gender") %>%
  mutate(gender = gsub(x = segment_description, 
                       pattern = " respondents", 
                       replacement = "")) %>%
  select(gender, answer:percentage)

# A tibble: 8 x 4
  gender answer    count percentage
  <chr>  <chr>     <dbl>      <dbl>
1 Female Instagram  1576      0.3  
2 Female Facebook    644      0.122
3 Female Snapchat   2967      0.564
4 Female Linkedin     73      0.014
5 Male   Instagram  1008      0.24 
6 Male   Facebook    565      0.135
7 Male   Snapchat   2483      0.591
8 Male   Linkedin    142      0.034

Now, this is in "tidy" format (Wickham, 2014) - which is great for plotting in packages like ggplot2. However, you may need to untidy it for plotting with base R graphics (or other reasons). You can untidy your data with function spread() from package tidyr.

library(tidyr)

dat %>%
  filter(segment_type == "Gender") %>%
  mutate(gender = gsub(x = segment_description, 
                       pattern = " respondents", 
                       replacement = "")) %>%
  select(gender, answer:count) %>%
  spread(key = gender, value = count)

# A tibble: 4 x 3
  answer    Female  Male
  <chr>      <dbl> <dbl>
1 Facebook     644   565
2 Instagram   1576  1008
3 Linkedin      73   142
4 Snapchat    2967  2483

Because you have two categorical variables (answer and gender), as well as a continuous variable (count), you'll probably want a stacked or grouped bar plot. Here's a great Stack Overflow post for very similar data and a few different solutions.

jamisoncrawford commented 4 years ago

P.S. How I'd approach it in ggplot2 would be like so:

library(dplyr)
library(tidyr)
library(readxl)
library(ggplot2)

file <- "2020-02-29_soc-media_data.xlsx"

dat <- read_excel(file); rm(file)

dat %>%
  filter(segment_type == "Gender") %>%
  mutate(gender = gsub(x = segment_description, 
                       pattern = " respondents", 
                       replacement = "")) %>%
  select(gender, answer:count) %>%
  ggplot(aes(x = reorder(answer, -count), 
             y = count, 
             fill = gender)) +
  geom_bar(stat = "identity") +
  theme_minimal()

This gives you:

image

You can spruce that up a bit with some extra functions/arguments in ggplot2 and comma from package scales:

library(dplyr)
library(tidyr)
library(readxl)
library(scales)
library(ggplot2)

file <- "2020-02-29_soc-media_data.xlsx"

dat <- read_excel(file); rm(file)

dat %>%
  filter(segment_type == "Gender") %>%
  mutate(gender = gsub(x = segment_description, 
                       pattern = " respondents", 
                       replacement = "")) %>%
  select(gender, answer:count) %>%
  ggplot(aes(x = reorder(answer, -count), 
             y = count, 
             fill = gender)) +
  geom_bar(stat = "identity",
           alpha = 0.75) +
  scale_y_continuous(labels = comma) +
  labs(fill = "Gender",
       x = "Preference",
       y = "Respondents",
       title = "Preferences by Platform & Gender",
       subtitle = "9,458 Respondents",
       caption = "Source: Data World") +
  theme_minimal()

image

And that's my code-through for your code-through :). Hope this helps!

karen-liz commented 4 years ago

Jamison, I kept reading online that ggplot2 would be a great package for this but I couldn't figure it. Thank you this is was so much help and I can now do what I have been trying to do for the past day! Thanks again and have a wonderful weekend!

giphy

lecy commented 4 years ago

A core R version as well :-)

# recreate this table
  answer    Female  Male
  <chr>      <dbl> <dbl>
1 Facebook     644   565
2 Instagram   1576  1008
3 Linkedin      73   142
4 Snapchat    2967  2483

t <- table( dat$segment_description, dat$answer )
# quick build of data for demo
segment <- c("Facebook","Instagram","Linkedin","Snapchat")
female <- c(644,1576,73,2967)
male <- c(565,1008,142,2483)
t <- rbind(female,male)

barplot( t, beside=T, 
        col=c("aquamarine3","coral"), 
        names.arg=segment )

legend( "topleft", c("female","male"), pch=15, 
       col=c("aquamarine3","coral"), 
       bty="n")

image

ggplot is a lot nicer, but the data steps can be more complicated at times.

jamisoncrawford commented 4 years ago

@karen-liz you're very welcome! Glad this was helpful. I think it's a great example of how these packages interface quite nicely in the "Tidyverse" ecosystem (tidyr, dplyr, readxl, and ggplot2). You can pipe your data directly from the web and visualize it in a single expression!

@lecy Thank you for this! I was actually struggling with grouped bar plots in base R graphics and I found an example that was very similar to yours - I just couldn't get the data layer right!

ggplot2 is nice but I'm sure you can make something really polished in graphics. The NYT visualization has given me a new appreciation!

qplot() in ggplot2 is easy to use and good for "quick and dirty" graphics but lacks customization options, so that's when you'd have to really learn the systems under the hood!

library(ggplot2)

qplot(data = tidy, 
      x = reorder(answer, -count), 
      y = count, 
      fill = gender, 
      xlab = "Respondents",
      ylab = "Preferences",
      main = "Preferences by Platform & Gender",
      geom = "col") +
  theme_minimal()

image

karen-liz commented 4 years ago

A core R version as well :-)

# recreate this table
  answer    Female  Male
  <chr>      <dbl> <dbl>
1 Facebook     644   565
2 Instagram   1576  1008
3 Linkedin      73   142
4 Snapchat    2967  2483

t <- table( dat$segment_description, dat$answer )
# quick build of data for demo
segment <- c("Facebook","Instagram","Linkedin","Snapchat")
female <- c(644,1576,73,2967)
male <- c(565,1008,142,2483)
t <- rbind(female,male)

barplot( t, beside=T, 
        col=c("aquamarine3","coral"), 
        names.arg=segment )

legend( "topleft", c("female","male"), pch=15, 
       col=c("aquamarine3","coral"), 
       bty="n")

image

ggplot is a lot nicer, but the data steps can be more complicated at times.

This is exactly what I was struggling with! Wow, I'm impressed that there are multiple ways to do this. Thank you both! This was extremely helpful!

jamisoncrawford commented 4 years ago

We're just nerding out in public :).

jamisoncrawford commented 4 years ago

P.S. Glad this helped! P.P.S. You can also plot it with 'googleVis', 'lattice', 'plotly', and a few other data viz packages!