bc-anaisabel / juniperus_paper

Pipeline for analyzing Illumina MiSeq paired-end data of fungal communities
BSD 2-Clause "Simplified" License
2 stars 1 forks source link

Create function to summarize mean and sd for all soil variables #12

Closed bc-anaisabel closed 3 years ago

bc-anaisabel commented 3 years ago

I want to create a function to summarize the mean and SD of all the variables I have in a data frame and combine them into one table.

At the moment I am obtaining the values for each of the variables separately using the function summarySE from the Rmisc package in R version 3.6.2 and then putting them together by creating a data frame object with all of them.

Input data looks like this:

# Import data 
read.csv("soilvariables.csv")
soilvariables<-read.csv("soilvariables.csv", row.names = 1)

# Summarize for each of the variables

pH <- summarySE(data, measurevar= "pH", groupvars=c("Site"), na.rm = TRUE)
Pdis <-summarySE(data, measurevar= "Pdis", groupvars=c("Site"), na.rm = TRUE)
Ca <-summarySE(data, measurevar= "Ca", groupvars=c("Site"), na.rm = TRUE)
Mg <-summarySE(data, measurevar= "Mg", groupvars=c("Site"), na.rm = TRUE)
K <-summarySE(data, measurevar= "K", groupvars=c("Site"), na.rm = TRUE)
Na <-summarySE(data, measurevar= "Na", groupvars=c("Site"), na.rm = TRUE)
H <-summarySE(data, measurevar= "H", groupvars=c("Site"), na.rm = TRUE)
Al <-summarySE(data, measurevar= "Al", groupvars=c("Site"), na.rm = TRUE)
SoilM <-summarySE(data, measurevar= "SoilM", groupvars=c("Site"), na.rm = TRUE)

# Create a data frame that gathers the results for each variable 
soilvar = (data.frame(pH,Pdis,Ca,Mg,K,Na,H,Al,SoilM))

What I am obtaining for each variable looks like this:

                              N        Al                     sd                    se                    ci
1   mixed          6    4.833333    3.430258    1.400397    3.599834
2   native          3   7.333333    2.081666    1.201850    5.171145
3   perturbated 3   8.000000    3.605551    2.081666    8.956686
VeronicaGlez commented 3 years ago

Can we use the functions in dplyr and tidyverse to group and summarize mean and sd only instead of using function summarySE?

https://rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf

bc-anaisabel commented 3 years ago

We tried to create a for loop using:

###summary(data)

pH <- summarySE(soilvariables, measurevar= "pH", groupvars=c("Site"), na.rm = TRUE)
Pdis <-summarySE(soilvariables, measurevar= "Pdis", groupvars=c("Site"), na.rm = TRUE)
Ca <-summarySE(data, measurevar= "Ca", groupvars=c("Site"), na.rm = TRUE)
Mg <-summarySE(data, measurevar= "Mg", groupvars=c("Site"), na.rm = TRUE)
K <-summarySE(data, measurevar= "K", groupvars=c("Site"), na.rm = TRUE)
Na <-summarySE(data, measurevar= "Na", groupvars=c("Site"), na.rm = TRUE)
H <-summarySE(data, measurevar= "H", groupvars=c("Site"), na.rm = TRUE)
Al <-summarySE(data, measurevar= "Al", groupvars=c("Site"), na.rm = TRUE)
SoilM <-summarySE(data, measurevar= "SoilM", groupvars=c("Site"), na.rm = TRUE)

for (i in c("pH","Pdis","Ca","Mg","K","Na","H","Al","SoilM")){i = vector()
i<- summarySE(data, measurevar= i, groupvars=c("Site"), na.rm = T)}
bc-anaisabel commented 3 years ago

We need to create a new column for the for loop that's named "Variable" and repeat the name of the variable in each row of the data frame, so it can use that column to use rowbind later.

We also realized that the vector for the for loop needs to have the dimension of what we need to obtain and we don't know how to do that in R

redgcko7 commented 3 years ago

So I think I figured out the FOR loop, after referencing an issue from a previous semester than Alicia mentioned...

data <-read.csv("soilvariables.csv", row.names = 1)

# extract variable names from data table column names ("soilvariables.csv")

x <- colnames(data[,1:9])

# loop that calculates summary statistics for each variable
# changes the third column name to "mean" instead of variable name
# and adds additional column titled 'Variable' with variable name

for (i in x){
  a <- summarySE(data, measurevar= i, groupvars=c("Site"), na.rm = T)
  names(a)[names(a) == i] <- "mean"
  a$Variable <- i
  assign(i,a)}

This results in 9 data frames (3 rows, 7 columns each) that can theoretically be combined and/or summarized using functions in dplyr and tidyverse. I tried to do this, but wasn't sure exactly how you want the finalized data table to look... if you post an example final table, maybe I can help more.

bc-anaisabel commented 3 years ago

Thanks! This worked. So now what I did was use the command rbind to get what I wanted, which was not a wide but a long format table:

Bind_soilvar<-rbind(pH, Pdis, Ca, Mg, K, Na, H, Al, SoilM, C, Nit)

The only thing I did so this could work was changing the name of my Nitrogen variable, because my abbreviation read "N" which was confusing for the summarySE function and for dplyr because that was also the name of one of the columns (the column for the number of samples, aka N). So instead of N I used Nit.

bc-anaisabel commented 3 years ago

So the simplified script looks like this:

# Import data 
data <-read.csv("soilvariables.csv", row.names = 1)

# extract variable names from data table column names ("soilvariables.csv")

x <- colnames(data[,1:11])

# loop that calculates summary statistics for each variable
# changes the third column name to "mean" instead of variable name
# and adds additional column titled 'Variable' with variable name

for (i in x){
  a <- summarySE(data, measurevar= i, groupvars=c("Site"), na.rm = T)
  names(a)[names(a) == i] <- "mean"
  a$Variable <- i
  assign(i,a)}

# Combine object by row names 
Bind_soilvar<-rbind(pH, Pdis, Ca, Mg, K, Na, H, Al, SoilM, C, Nit)

The output looks like this:

bind_soilvar.xlsx