Closed nirajwagh314 closed 2 years ago
Hi,
I solved above issue by using colSums of my df.
Now my new question is when calling the colSums function it gives me a nice neat wide df but when I try to add the year I run into issues. My goal is to bind together colSums from multiple years. I have tried using the mutate function, I have also tried calling the df and using the $year = 2018 for instance. However, this turns my data into an itemized list rather than a nice table. I also tried cbind but it made my data long without a title for my column that had race in it which makes it tough to create a line graph based on race over time...
Here is my code I was using:
library(tidyverse)
library(dplyr)
## ---------------------------
## directory paths
## ---------------------------
## assume we're running this script from the ./scripts subdirectory
dat_dir <- file.path("..", "data")
## -----------------------------------------------------------------------------
## Wrangle data
## -----------------------------------------------------------------------------
## ---------------------------
## input
## ---------------------------
## data are CSV, so we use read_csv() from the readr library
## read in df for each year
df_1 <- read_csv(file.path(dat_dir, "gr2018.csv"))
df_1 <- df_1 %>%
select(GRTYPE, CHRTSTAT, GRAIANT, GRASIAT, GRBKAAT, GRHISPT, GRNHPIT, GRWHITT, GR2MORT, GRUNKNT, GRNRALT) %>%
filter(GRTYPE == 30) %>%
select(GRAIANT, GRASIAT, GRBKAAT, GRHISPT, GRNHPIT, GRWHITT, GR2MORT, GRUNKNT, GRNRALT )
df_1a <- colSums (df_1)
df_1a
df_1a$year = 2018
df_1a
#2017 data
df_2 <- read_csv(file.path(dat_dir, "gr2017_rv.csv"))
df_2 <- df_2 %>%
select(GRTYPE, CHRTSTAT, GRAIANT, GRASIAT, GRBKAAT, GRHISPT, GRNHPIT, GRWHITT, GR2MORT, GRUNKNT, GRNRALT) %>%
filter(GRTYPE == 30) %>%
select(GRAIANT, GRASIAT, GRBKAAT, GRHISPT, GRNHPIT, GRWHITT, GR2MORT, GRUNKNT, GRNRALT )
df_2a <- colSums (df_2)
df_2a <- df_2a %>%
df_1a$year = 2017
## append files
df_bind <- bind_rows(df_1a, df_2a)
## show
df_bind
@nirajwagh314, first thing: go back and check how I edited your code to include code blocks around your code. This makes it easier to read, so use those in the future.
Right now you are using a mixture of base R and tidyverse R. This is not inherently wrong, but it may be the root of your formatting issues. Instead of using colSums()
, could you use the summarize()
function (if you want to summarize multiple columns, check out summarize_at()
or summarize_all()
)? Then you can use mutate()
to add your year column, all in the same dplyr chain. Once you've done that for each year, you should be able to use bind_rows()
.
Give that a try and let me know.
@nirajwagh314, first thing: go back and check how I edited your code to include code blocks around your code. This makes it easier to read, so use those in the future.
Right now you are using a mixture of base R and tidyverse R. This is not inherently wrong, but it may be the root of your formatting issues. Instead of using
colSums()
, could you use thesummarize()
function (if you want to summarize multiple columns, check outsummarize_at()
orsummarize_all()
)? Then you can usemutate()
to add your year column, all in the same dplyr chain. Once you've done that for each year, you should be able to usebind_rows()
.Give that a try and let me know.
df_1 <- read_csv(file.path(dat_dir, "gr2018.csv"))
df_1 <- df_1 %>%
select(GRTYPE, CHRTSTAT, GRAIANT, GRASIAT, GRBKAAT, GRHISPT, GRNHPIT, GRWHITT, GR2MORT, GRUNKNT, GRNRALT) %>%
filter(GRTYPE == 30) %>%
select(GRAIANT, GRASIAT, GRBKAAT, GRHISPT, GRNHPIT, GRWHITT, GR2MORT, GRUNKNT, GRNRALT )
#df_1a <- colSums (df_1)
df_1a <- df_1 %>%
summarize(df_1) %>%
mutate(year = 2018)
df_1a
This is how I tried to do it but it is giving me an error:
Error: Column df_1
must be length 1 (a summary value), not 9
I am not sure how to fix this... thanks for your help.
Side note, I tried my best to do a code block above but I did the three tick marks at the beginning and end of the code but it is not working for me either. :(
Actually, I think I figured it out. :) Thanks.
Almost had it! When you do three code ticks, they need to be on their own line. You can click the ...
on your comment above to see what I did.
When you do use summarize()
, remember that it's much like mutate()
in that you need to give it a new column name that equals the result of a summary function. So
df_1a <- df_1 %>%
summarize(col_1_sum = sum(col_1),
col_2_sum = sum(col_2)) %>%
mutate(year = 2018)
Since you want to sum
a number of columns, you might look into summarise_all()
or across
. We didn't cover these because they aren't strictly necessary for your analysis, but you might find them helpful.
Closing since it's older
Hi Dr. Skinner,
I am struggling a lot with my initial analyses for my data set for graduation year of 2018.
My question is when I call the df it does not find the variables like GRASIAT amongst others. I am not sure why... I want to do individual variable for the sum of each column so that I can make a simpler data frame composed of race and sum.
Thanks,
Niraj
GR2018.zip