edquant / edh7916

Course materials and website for EDH7916: Contemporary Research in Higher Education
https://edquant.github.io/edh7916/
3 stars 1 forks source link

Initial Analyses Question #33

Closed nirajwagh314 closed 2 years ago

nirajwagh314 commented 4 years ago

Hi Dr. Skinner,

I am struggling a lot with my initial analyses for my data set for graduation year of 2018.

## ---------------------------
## libraries
## ---------------------------

library(tidyverse)
library(dplyr)

## ---------------------------
## directory paths
## ---------------------------

## assume we're running this script from the ./scripts subdirectory
dat_dir <- file.path("..", "data")

## -----------------------------------------------------------------------------
## Wrangle data
## -----------------------------------------------------------------------------

## ---------------------------
## input
## ---------------------------

## data are CSV, so we use read_csv() from the readr library

## read in df for each year

df <- read_csv(file.path(dat_dir, "gr2018.csv"))
df <- df %>%
  select(GRTYPE, CHRTSTAT, GRTOTLT, GRAIANT, GRASIAT, XGRASIAT, GRBKAAT, GRHISPT, GRNHPIT, GRWHITT, GR2MORT, GRUNKNT, GRNRALT) %>%
  mutate(Year = 2018) %>%
  mutate(GRASIAT = ifelse(XGRASIAT == A, 0, GRASIAT)) %>%
  filter(GRTYPE == 30)

a = sum(df, GRASIAT)

a

My question is when I call the df it does not find the variables like GRASIAT amongst others. I am not sure why... I want to do individual variable for the sum of each column so that I can make a simpler data frame composed of race and sum.

Thanks,

Niraj

GR2018.zip

nirajwagh314 commented 4 years ago

Hi,

I solved above issue by using colSums of my df.

Now my new question is when calling the colSums function it gives me a nice neat wide df but when I try to add the year I run into issues. My goal is to bind together colSums from multiple years. I have tried using the mutate function, I have also tried calling the df and using the $year = 2018 for instance. However, this turns my data into an itemized list rather than a nice table. I also tried cbind but it made my data long without a title for my column that had race in it which makes it tough to create a line graph based on race over time...

Here is my code I was using:

library(tidyverse)
library(dplyr)

## ---------------------------
## directory paths
## ---------------------------

## assume we're running this script from the ./scripts subdirectory
dat_dir <- file.path("..", "data")

## -----------------------------------------------------------------------------
## Wrangle data
## -----------------------------------------------------------------------------

## ---------------------------
## input
## ---------------------------

## data are CSV, so we use read_csv() from the readr library

## read in df for each year

df_1 <- read_csv(file.path(dat_dir, "gr2018.csv"))
df_1 <- df_1 %>%
  select(GRTYPE, CHRTSTAT, GRAIANT, GRASIAT, GRBKAAT, GRHISPT, GRNHPIT, GRWHITT, GR2MORT, GRUNKNT, GRNRALT) %>%
  filter(GRTYPE == 30) %>%
  select(GRAIANT, GRASIAT, GRBKAAT, GRHISPT, GRNHPIT, GRWHITT, GR2MORT, GRUNKNT, GRNRALT )

df_1a <- colSums (df_1)

df_1a

df_1a$year = 2018

df_1a

#2017 data

df_2 <- read_csv(file.path(dat_dir, "gr2017_rv.csv"))

df_2 <- df_2 %>%
  select(GRTYPE, CHRTSTAT, GRAIANT, GRASIAT, GRBKAAT, GRHISPT, GRNHPIT, GRWHITT, GR2MORT, GRUNKNT, GRNRALT) %>%
  filter(GRTYPE == 30) %>%
  select(GRAIANT, GRASIAT, GRBKAAT, GRHISPT, GRNHPIT, GRWHITT, GR2MORT, GRUNKNT, GRNRALT )

df_2a <- colSums (df_2) 

df_2a <- df_2a %>%

df_1a$year = 2017

## append files
df_bind <- bind_rows(df_1a, df_2a)

## show
df_bind

GR2017.zip GR2018.zip

btskinner commented 4 years ago

@nirajwagh314, first thing: go back and check how I edited your code to include code blocks around your code. This makes it easier to read, so use those in the future.

Right now you are using a mixture of base R and tidyverse R. This is not inherently wrong, but it may be the root of your formatting issues. Instead of using colSums(), could you use the summarize() function (if you want to summarize multiple columns, check out summarize_at() or summarize_all())? Then you can use mutate() to add your year column, all in the same dplyr chain. Once you've done that for each year, you should be able to use bind_rows().

Give that a try and let me know.

nirajwagh314 commented 4 years ago

@nirajwagh314, first thing: go back and check how I edited your code to include code blocks around your code. This makes it easier to read, so use those in the future.

Right now you are using a mixture of base R and tidyverse R. This is not inherently wrong, but it may be the root of your formatting issues. Instead of using colSums(), could you use the summarize() function (if you want to summarize multiple columns, check out summarize_at() or summarize_all())? Then you can use mutate() to add your year column, all in the same dplyr chain. Once you've done that for each year, you should be able to use bind_rows().

Give that a try and let me know.

df_1 <- read_csv(file.path(dat_dir, "gr2018.csv"))
df_1 <- df_1 %>%
  select(GRTYPE, CHRTSTAT, GRAIANT, GRASIAT, GRBKAAT, GRHISPT, GRNHPIT, GRWHITT, GR2MORT, GRUNKNT, GRNRALT) %>%
  filter(GRTYPE == 30) %>%
  select(GRAIANT, GRASIAT, GRBKAAT, GRHISPT, GRNHPIT, GRWHITT, GR2MORT, GRUNKNT, GRNRALT )

#df_1a <- colSums (df_1)

df_1a <- df_1 %>%
 summarize(df_1) %>%
  mutate(year = 2018)

df_1a

This is how I tried to do it but it is giving me an error:

Error: Column df_1 must be length 1 (a summary value), not 9

I am not sure how to fix this... thanks for your help.

nirajwagh314 commented 4 years ago

Side note, I tried my best to do a code block above but I did the three tick marks at the beginning and end of the code but it is not working for me either. :(

nirajwagh314 commented 4 years ago

Actually, I think I figured it out. :) Thanks.

btskinner commented 4 years ago

Almost had it! When you do three code ticks, they need to be on their own line. You can click the ... on your comment above to see what I did.

When you do use summarize(), remember that it's much like mutate() in that you need to give it a new column name that equals the result of a summary function. So

df_1a <- df_1 %>%
   summarize(col_1_sum = sum(col_1),
             col_2_sum = sum(col_2)) %>%
   mutate(year = 2018)

Since you want to sum a number of columns, you might look into summarise_all() or across. We didn't cover these because they aren't strictly necessary for your analysis, but you might find them helpful.

btskinner commented 2 years ago

Closing since it's older