I am learning with the following tutorial

https://chiliubio.github.io/microeco/

microeco

An R package for data mining in microbial community ecology

Background

In microbial community ecology, with the development of high-throughput sequencing techniques,

the increasing data amount and complexity make the community data analysis and management a challenge.

There has been a lot of R packages created for the microbiome profiling analysis.

$However, it is still difficult to perform data mining fast and efficiently.

Therefore, we created R microeco package.

install.packages("microeco")

If devtools package is not installed, first install it

install.packages("devtools")

then install microeco

devtools::install_github("ChiLiubio/microeco")

library(microeco)

Dependence

Important packages

If a package is not installed, it will be installed from CRAN.

First select the packages of interest

packages <- c("reshape2", "MASS", "GUniFrac", "ggpubr", "randomForest", "ggdendro", "ggrepel", "agricolae", "gridExtra", "picante", "pheatmap", "igraph", "rgexf", "ggalluvial")

Now check or install

lapply(packages, function(x) { if(!require(x, character.only = TRUE)) { install.packages(x, dependencies = TRUE) }})

ggtree

Plotting the cladogram from the LEfSe result requires the ggtree package in bioconductor

(https://bioconductor.org/packages/release/bioc/html/ggtree.html).

if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install("ggtree")

library("writexl") library(phyloseq) library(raster) library(ggplot2) library(vegan)

-------------------------------------------------------------------

-------------------BASIC CLASS

library(microeco)

load the example data; 16S rRNA gene amplicon sequencing dataset

data(sample_info_16S) data(otu_table_16S) data(taxonomy_table_16S) data(phylo_tree_16S)

load the environmental data which is detached from sample table

data(env_data_16S)

use pipe operator in magrittr package

library(magrittr)

set.seed is used to fix the random number generation to make the results repeatable

set.seed(123)

make the plotting background same with the tutorial

library(ggplot2) theme_set(theme_bw())

class(otu_table_16S)

taxonomy_table_16S[1:5, 1:3]

make the taxonomic information unified, important

taxonomy_table_16S %<>% tidy_taxonomy

class(sample_info_16S)

sample_info_16S[1:5, ]

VKS-----Just to see the table

VKS <- sample_info_16S[] fix(VKS)

VKS <- taxonomy_table_16S[] fix(VKS)

class(env_data_16S) env_data_16S

class(phylo_tree_16S) phylo_tree_16S

we create an object of microtable class.

This operation is very similar with the package phyloseq(Mcmurdie and Holmes 2013),

but microeco is more brief and simpler.

The otu_table in the microtable class must be the species-sample format: rownames must be OTU names,

colnames must be sample names.

The required sample names must be same in rownames of sample_table and colnames of otu_table.

In R6 class, '$new' is the original method used to create a new object of class

dataset <- microtable$new(otu_table = otu_table_16S)

No sample_table provided, automatically use colnames in otu_table to create one ...

If you only provide abundance table, the class can help you create a sample info table

class(dataset)

dataset

microtable class:

sample_table have 90 rows and 2 columns

otu_table have 13628 rows and 90 columns

Let's create a dataset microtable with more information

dataset <- microtable$new(sample_table = sample_info_16S, otu_table = otu_table_16S, tax_table = taxonomy_table_16S, phylo_tree = phylo_tree_16S) dataset

microtable class:

sample_table have 90 rows and 4 columns

otu_table have 13628 rows and 90 columns

tax_table have 13628 rows and 7 columns

phylo_tree have 14096 tips

To make the species and sample information consistent across different files in the dataset object,

we can use function tidy_dataset() to trim the dataset.

dataset$tidy_dataset() print(dataset)

microtable class:

sample_table have 90 rows and 4 columns

otu_table have 13628 rows and 90 columns

tax_table have 13628 rows and 7 columns

phylo_tree have 13628 tips

Then, we remove OTUs which are not assigned in the Kingdom "kArchaea" or "kBacteria".

dataset$tax_table %<>% base::subset(Kingdom == "kArchaea" | Kingdom == "kBacteria") print(dataset)

microtable class:

sample_table have 90 rows and 4 columns

otu_table have 13628 rows and 90 columns

tax_table have 13330 rows and 7 columns

phylo_tree have 13628 tips

We also remove OTUs with the taxonomic assignments “mitochondria” or “chloroplast.”

This will remove the lines containing the taxa word regardless of taxonomic ranks and ignoring word case in the tax_table.

So if you want to filter some taxa not considerd pollutions, please use subset like the previous operation.

dataset$filter_pollution(taxa = c("mitochondria", "chloroplast"))

print(dataset)

Then, to make the OTUs same in otu_table, tax_table and phylo_tree, we use tidy_dataset() again.

dataset$tidy_dataset() print(dataset)

then we use sample_sums() to check the sequence numbers in each sample.

dataset$sample_sums() %>% range

[1] 10316 37087

Sometimes, in order to reduce the effects of sequencing depth on the diversity measurements,

we need to perform the resampling to make the sequence number equal for each sample.

The function rarefy_samples can invoke the function tidy_dataset automatically before and after the rarefying.

As an example, we use 10000 sequences in each sample

dataset$rarefy_samples(sample.size = 10000)

dataset$sample_sums() %>% range

Then, we calculate the taxa abundance at each taxonomic rank using cal_abund().

This function return a list called taxa_abund containing several data frame of

the abundance information at each taxonomic rank. The list is stored in the microtable object automatically.

It’s worth noting that the cal_abund() function can be used to solve some complex cases,

such as supporting both the relative and absolute abundance calculation and selecting the partial taxonomic columns.

Those have been shown in README of file2meco package (https://github.com/ChiLiubio/file2meco).

dataset$cal_abund()

return dataset$taxa_abund

class(dataset$taxa_abund)

The function save_abund() can be used to save the taxa abundance file to a local place.

dir.create("taxa_abund") dataset$save_abund(dirpath = "taxa_abund")

Then, let’s calculate the alpha diversity. The result is also stored in the object microtable automatically.

For the definition of each alpha diversity index,

please see http://scikit-bio.org/docs/latest/generated/skbio.diversity.alpha.html

As an example, we do not calculate phylogenetic diversity.

If you want to add Faith's phylogenetic diversity, use PD = TRUE, this will be a little slow

dataset$cal_alphadiv(PD = FALSE)

return dataset$alpha_diversity

class(dataset$alpha_diversity)

save dataset$alpha_diversity to a directory

dir.create("alpha_diversity") dataset$save_alphadiv(dirpath = "alpha_diversity")

We also calculate the distance matrix of beta diversity using function cal_betadiv().

We provide four most frequently used indexes: Bray-curtis, Jaccard, weighted Unifrac and unweighted unifrac.

If you do not want to calculate unifrac metrics, use unifrac = FALSE

require GUniFrac package

dataset$cal_betadiv(unifrac = TRUE)

return dataset$beta_diversity

class(dataset$beta_diversity)

save dataset$beta_diversity to a directory

dir.create("beta_diversity") dataset$save_betadiv(dirpath = "beta_diversity")

------------------------------------------------------------

--------------------3.1 microtable

--------visit-------------------------class https://chiliubio.github.io/microeco_tutorial/basic-class.html#microtable-class

The microtable class is the basic class and designed to store the basic data

for all the downstream analysis in the microeco package.

At least, the OTU table (i.e. species-sample abundance table) should be provided to create microtable object.

Thus, the microtable class can recognize the sample information table is missing

and create a default sample table according to the sample names of otu_table.

in this tutorial, we use the data inside the package microeco to show some operations.

The 16S rRNA sequencing results in the example data of the package is used to show the main part of the tutorial.

This dataset is the 16S rRNA gene Miseq sequencing results of wetland soils in China published by An et al.(An et al. 2019),

who surveyed soil bacterial communities in Chinese inland wetlands (IW),

coastal wetland (CW) and Tibet plateau wetlands (TW) using 16S rRNA gene amplicon sequencing method.

These wetlands include both saline and non-saline samples.

The sample information table have 4 columns: “SampleID,” “Group,” “Type” and “Saline.”

The column “SampleID” is same with the rownames. The column “Group” represents the IW, CW and TW.

The column “Type” represents the sampling region: northeastern region (NE), northwest region (NW), North China area (NC),

middle-lower reaches of the Yangtze River (YML), southern coastal area (SC), upper reaches of the Yangtze River (YU),

Qinghai-Tibet Plateau (QTP). The column “Saline” represents the saline soils and non-saline soils.

In this dataset, the environmental factor table is separated from the sample information table.

Another ITS sequencing dataset is also stored in the example data of the package(Gao et al. 2019).

library(microeco)

load the example data; 16S rRNA gene amplicon sequencing dataset

data(sample_info_16S) data(otu_table_16S) data(taxonomy_table_16S) data(phylo_tree_16S)

load the environmental data which is detached from sample table

data(env_data_16S)

use pipe operator in magrittr package

library(magrittr)

set.seed is used to fix the random number generation to make the results repeatable

set.seed(123)

make the plotting background same with the tutorial

library(ggplot2) theme_set(theme_bw())

Make sure that the data types of sample_table, otu_table and tax_table are all data.frame as the following part shows.

class(otu_table_16S) otu_table_16S[1:5, 1:5]

class(taxonomy_table_16S) taxonomy_table_16S[1:5, 1:3]

make the taxonomic information unified, important

taxonomy_table_16S %<>% tidy_taxonomy

class(sample_info_16S) sample_info_16S[1:5, ]

class(env_data_16S)

class(phylo_tree_16S)

In R6 class, '$new' is the original method used to create a new object of class

dataset <- microtable$new(otu_table = otu_table_16S)

If you only provide abundance table, the class can help you create a sample info table

class(dataset) dataset

Let's create a dataset microtable with more information

dataset <- microtable$new(sample_table = sample_info_16S, otu_table = otu_table_16S, tax_table = taxonomy_table_16S, phylo_tree = phylo_tree_16S) dataset

dataset$tidy_dataset() print(dataset)

dataset$tax_table %<>% base::subset(Kingdom == "kArchaea" | Kingdom == "kBacteria") print(dataset)

This will remove the lines containing the taxa word regardless of taxonomic ranks and ignoring word case in the tax_table.

So if you want to filter some taxa not considerd pollutions, please use subset like the previous operation.

dataset$filter_pollution(taxa = c("mitochondria", "chloroplast")) print(dataset)

dataset$tidy_dataset() print(dataset)

dataset$sample_sums() %>% range

As an example, we use 10000 sequences in each sample

dataset$rarefy_samples(sample.size = 10000)

dataset$sample_sums() %>% range

dataset$cal_abund()

return dataset$taxa_abund

class(dataset$taxa_abund)

dir.create("taxa_abund") dataset$save_abund(dirpath = "taxa_abund")

If you want to add Faith's phylogenetic diversity, use PD = TRUE, this will be a little slow

dataset$cal_alphadiv(PD = FALSE)

return dataset$alpha_diversity

class(dataset$alpha_diversity)

save dataset$alpha_diversity to a directory

dir.create("alpha_diversity") dataset$save_alphadiv(dirpath = "alpha_diversity")

If you do not want to calculate unifrac metrics, use unifrac = FALSE

require GUniFrac package

dataset$cal_betadiv(unifrac = TRUE)

return dataset$beta_diversity

class(dataset$beta_diversity)

save dataset$beta_diversity to a directory

dir.create("beta_diversity") dataset$save_betadiv(dirpath = "beta_diversity")

------------------------------------------------------------------------------------------

---------------Chapter 4 Extended class

All the extended classes depend on the microtable class.

Generally, one class can work alone.

In some cases, one function of a class may interact with the object from another class to complete a method.

---------------------------4.1 trans_abund class

create trans_abund object

use 10 Phyla with the highest abundance in the dataset.

t1 <- trans_abund$new(dataset = dataset, taxrank = "Phylum", ntaxa = 10)

t1 object now include the transformed abundance data t1$abund_data and other elements for the following plotting

t1$plot_bar(others_color = "grey70", facet = "Group", xtext_keep = FALSE, legend_text_italic = FALSE)

return a ggplot2 object

The groupmean parameter can be used to obtain the group-mean barplot.

t1 <- trans_abund$new(dataset = dataset, taxrank = "Phylum", ntaxa = 10, groupmean = "Group") t1$plot_bar(others_color = "grey70", legend_text_italic = FALSE)

t1 <- trans_abund$new(dataset = dataset, taxrank = "Phylum", ntaxa = 8)

use_alluvium = TRUE make the alluvial plot, clustering = TRUE can be used to reorder the samples by clustering

t1$plot_bar(use_alluvium = TRUE, clustering = TRUE, xtext_type_hor = FALSE, xtext_size = 6)

show 15 taxa at Class level

t1 <- trans_abund$new(dataset = dataset, taxrank = "Class", ntaxa = 15) t1$plot_box(group = "Group")

show 40 taxa at Genus level

t1 <- trans_abund$new(dataset = dataset, taxrank = "Genus", ntaxa = 40) t1$plot_heatmap(facet = "Group", xtext_keep = FALSE, withmargin = FALSE)

t1 <- trans_abund$new(dataset = dataset, taxrank = "Phylum", ntaxa = 6, groupmean = "Group")

all pie chart in one row

t1$plot_pie(facet_nrow = 1)

merge samples as one community for each group

dataset1 <- dataset$merge_samples(use_group = "Group")

dataset1 is a new microtable object

create trans_venn object

t1 <- trans_venn$new(dataset1, ratio = "seqratio") t1$plot_venn()

The integer data is OTU number

The percentage data is the sequence number/total sequence number

use "Type" column in sample_table

dataset1 <- dataset$merge_samples(use_group = "Type") t1 <- trans_venn$new(dataset1) t1$plot_venn(petal_plot = TRUE)

dataset1 <- dataset$merge_samples(use_group = "Group") t1 <- trans_venn$new(dataset1)

transform venn results to the sample-species table, here do not consider abundance, only use presence/absence information.

t2 <- t1$trans_venn_com(use_OTUs_frequency = TRUE)

t2 is a new microtable class, each part is considered as a sample

class(t2)

calculate taxa abundance, that is, the frequency

t2$cal_abund()

transform and plot

t3 <- trans_abund$new(dataset = t2, taxrank = "Genus", ntaxa = 10) t3$plot_bar(bar_type = "part", legend_text_italic = T, ylab_title = "Frequency (%)", xtext_type_hor = FALSE)

t3 <- trans_abund$new(dataset = t2, taxrank = "Phylum", ntaxa = 8) t3$plot_pie(facet_nrow = 3, use_colors = rev(c(RColorBrewer::brewer.pal(8, "Dark2"), "grey50")))

---------------------------------------------4.3 trans_alpha class--------------------------------------------

t1 <- trans_alpha$new(dataset = dataset, group = "Group")

return t1$alpha_stat

t1$alpha_stat[1:5, ]

t1$cal_diff(method = "KW")

return t1$res_alpha_diff

t1$res_alpha_diff[1:5, ]

t1$cal_diff(method = "anova")

return t1$res_alpha_diff

t1$res_alpha_diff

t1$plot_alpha(add_letter = T, measure = "Chao1", use_boxplot = FALSE)

t1$plot_alpha(pair_compare = TRUE, measure = "Chao1", shape = "Group")

t1 <- trans_alpha$new(dataset = dataset, group = "Group") t1$cal_diff(method = "anova", anova_set = "Group+Type")

now the result t1$res_alpha_diff is a list

see the help document for the usage of anova_set

---------------------------------------------4.4 trans_beta classs--------------------------------------------

we first create an trans_beta object

this operation invoke the distance matrix of bray in dataset$beta_diversity

t1 <- trans_beta$new(dataset = dataset, group = "Group", measure = "bray")

Please also cite the original paper: An et al. (2019). Soil bacterial community structure in Chinese wetlands. Geoderma, 337, 290-299.

t1$cal_ordination(ordination = "PCoA")

t1$res_ordination is the ordination result list

class(t1$res_ordination)

plot the PCoA result

t1$plot_ordination(plot_color = "Group", plot_shape = "Group", plot_group_ellipse = TRUE)

calculate and plot sample distances within groups

t1$cal_group_distance()

return t1$res_group_distance

t1$plot_group_distance(distance_pair_stat = TRUE)

calculate and plot sample distances between groups

t1$cal_group_distance(within_group = FALSE) t1$plot_group_distance(distance_pair_stat = TRUE)

use replace_name to set the label name, group parameter used to set the color

t1$plot_clustering(group = "Group", replace_name = c("Saline", "Type"))

manova for all groups

t1$cal_manova(cal_manova_all = TRUE) t1$res_manova$aov.tab

manova for each paired groups

t1$cal_manova(cal_manova_paired = TRUE) t1$res_manova

manova for specified group set: here "Group + Type"

t1$cal_manova(cal_manova_set = "Group + Type") t1$res_manova$aov.tab

PERMDISP for the whole comparison and for each paired groups

t1$cal_betadisper()

t1$res_betadisper

----------------------------------4.5 trans_diff class------------------------------------------------

metastat analysis at Genus level

t1 <- trans_diff$new(dataset = dataset, method = "metastat", group = "Group", metastat_taxa_level = "Genus")

t1$res_metastat is the result

t1$res_metastat_group_matrix is the group comparisons order for plotting

plot the first paired groups, choose_group = 1

t1$plot_metastat(use_number = 1:10, qvalue = 0.05, choose_group = 1)

t1 <- trans_diff$new(dataset = dataset, method = "lefse", group = "Group", alpha = 0.01, lefse_subgroup = NULL)

t1$res_lefse is the LEfSe result

t1$res_abund is the abundance information

t1$plot_lefse_bar(LDA_score = 4)

t1$res_lefse[1:5, ]

Extra------VKS--------------to view

v <-t1$res_lefse[] fix(v)

t1$plot_diff_abund(use_number = 1:30)

clade_label_level 5 represent phylum level in this analysis

require ggtree package

t1$plot_lefse_cladogram(use_taxa_num = 200, use_feature_num = 50, clade_label_level = 5)

choose some taxa according to the positions in the previous picture; those taxa labels have minimum overlap

use_labels <- c("cDeltaproteobacteria", "cActinobacteria", "oRhizobiales", "pProteobacteria", "pBacteroidetes", "oMicrococcales", "pAcidobacteria", "pVerrucomicrobia", "pFirmicutes", "pChloroflexi", "cAcidobacteria", "cGammaproteobacteria", "cBetaproteobacteria", "cKD4-96", "cBacilli", "oGemmatimonadales", "fGemmatimonadaceae", "oBacillales", "o__Rhodobacterales")

then use parameter select_show_labels to show

t1$plot_lefse_cladogram(use_taxa_num = 200, use_feature_num = 50, select_show_labels = use_labels)

Now we can see that more taxa names appear in the tree

use Genus level for parameter rf_taxa_level, if you want to use all taxa, change to "all"

nresam = 1 and boots = 1 represent no bootstrapping and use all samples directly

t1 <- trans_diff$new(dataset = dataset, method = "rf", group = "Group", rf_taxa_level = "Genus")

t1$res_rf is the result stored in the object

plot the result

t2 <- t1$plot_diff_abund(use_number = 1:20, only_abund_plot = FALSE) gridExtra::grid.arrange(t2$p1, t2$p2, ncol=2, nrow = 1, widths = c(2,2))

the middle asterisk represent the significances

------------------------------------------4.6 trans_env class-------------------------------------

add_data is used to add the environmental data

t1 <- trans_env$new(dataset = dataset, add_data = env_data_16S[, 4:11])

use bray-curtis distance to do dbrda

t1$cal_rda(use_dbrda = TRUE, use_measure = "bray")

t1$res_rda is the result list stored in the object

t1$trans_rda(adjust_arrow_length = TRUE, max_perc_env = 10)

t1$res_rda_trans is the transformed result for plotting

t1$plot_rda(plot_color = "Group")

use Genus

t1$cal_rda(use_dbrda = FALSE, taxa_level = "Genus")

As the main results of RDA are related with the projection and angles between different arrows,

we adjust the length of the arrow to show them clearly using several parameters.

t1$trans_rda(show_taxa = 10, adjust_arrow_length = TRUE, max_perc_env = 1500, max_perc_tax = 3000, min_perc_env = 200, min_perc_tax = 300)

t1$res_rda_trans is the transformed result for plotting

t1$plot_rda(plot_color = "Group")

use Genus

t1$cal_rda(use_dbrda = FALSE, taxa_level = "Genus")

As the main results of RDA are related with the projection and angles between different arrows,

we adjust the length of the arrow to show them clearly using several parameters.

t1$trans_rda(show_taxa = 10, adjust_arrow_length = TRUE, max_perc_env = 1500, max_perc_tax = 3000, min_perc_env = 200, min_perc_tax = 300)

t1$res_rda_trans is the transformed result for plotting

t1$plot_rda(plot_color = "Group")

t1$cal_mantel(use_measure = "bray")

return t1$res_mantel

t1$res_mantel

t1 <- trans_env$new(dataset = dataset, add_data = env_data_16S[, 4:11]) t1$cal_cor(use_data = "Genus", p_adjust_method = "fdr")

return t1$res_cor

t1$res_cor

default ggplot2 method with clustering

t1$plot_cor()

default ggplot2 method with clustering

t1$plot_cor()

first create trans_diff object as a demonstration

t2 <- trans_diff$new(dataset = dataset, method = "rf", group = "Group", rf_taxa_level = "Genus")

then create trans_env object

t1 <- trans_env$new(dataset = dataset, add_data = env_data_16S[, 4:11])

use other_taxa to select taxa you need

t1$cal_cor(use_data = "other", p_adjust_method = "fdr", other_taxa = t2$res_rf$Taxa[1:40])

t1$plot_cor()

clustering heatmap; require pheatmap package

t1$plot_cor(pheatmap = TRUE)

calculate correlations for different groups using parameter by_group

t1$cal_cor(by_group = "Group", use_data = "other", p_adjust_method = "fdr", other_taxa = t2$res_rf$Taxa[1:40])

return t1$res_cor

t1$plot_cor()

t1 <- trans_env$new(dataset = dataset, add_data = env_data_16S[, 4:11])

use add_abund_table parameter to add the extra data table

t1$cal_cor(add_abund_table = dataset$alpha_diversity) t1$plot_cor()

use pH and bray-curtis distance

t1$plot_scatterfit( x = "pH", y = dataset$beta_diversity$bray[rownames(t1$env_data), rownames(t1$env_data)], alpha = .1, x_axis_title = "Euclidean distance of pH", y_axis_title = "Bray-Curtis distance", text_x_pos = 4, text_y_pos = .4)

-----------------------------------------4.7 trans_nullmodel class--------------------------

generate trans_nullmodel object; use 1000 OTUs as example

t1 <- trans_nullmodel$new(dataset, taxa_number = 1000, add_data = env_data_16S)

use pH as the test variable

t1$cal_mantel_corr(use_env = "pH")

return t1$res_mantel_corr

plot the mantel correlogram

t1$plot_mantel_corr()

null model run 500 times

t1$cal_ses_betampd(runs=500, abundance.weighted = TRUE)

return t1$res_ses_betampd

add betaNRI matrix to beta_diversity list

dataset$beta_diversity[["betaNRI"]] <- t1$res_ses_betampd

create trans_beta class, use measure "betaNRI"

t2 <- trans_beta$new(dataset = dataset, group = "Group", measure = "betaNRI")

transform the distance for each group

t2$cal_group_distance()

plot the results

g1 <- t2$plot_group_distance(distance_pair_stat = TRUE) g1 + geom_hline(yintercept = -2, linetype = 2) + geom_hline(yintercept = 2, linetype = 2)

we create a list to store the trans_nullmodel results.

sesbeta_each <- list() group_col <- "Group" all_groups <- unique(dataset$sample_table[, group_col])

calculate for each group, respectively

for(i in all_groups){

like the above operation, but need provide 'group' and 'select_group'

test <- trans_nullmodel$new(dataset, group = group_col, select_group = i, taxa_number = 1000, add_data = env_data_16S) test$cal_ses_betampd(runs = 500, abundance.weighted = TRUE) sesbeta_each[[i]] <- test$res_ses_betampd }

merge and reshape to generate one symmetrical matrix

test <- lapply(sesbeta_each, melt) %>% do.call(rbind, .) %>% reshape2::dcast(., Var1~Var2, value.var = "value") %>% row.names<-(.[,1]) %>% .[, -1, drop = FALSE]

like the above operation

dataset$beta_diversity[["betaNRI"]] <- test t2 <- trans_beta$new(dataset = dataset, group = "Group", measure = "betaNRI") t2$cal_group_distance() g1 <- t2$plot_group_distance(distance_pair_stat = TRUE) g1 + geom_hline(yintercept = -2, linetype = 2) + geom_hline(yintercept = 2, linetype = 2)

https://chiliubio.github.io/microeco/

microeco

An R package for data mining in microbial community ecology

Background

install.packages("microeco")

If devtools package is not installed, first install it

install.packages("devtools")

then install microeco

devtools::install_github("ChiLiubio/microeco")

library(microeco)

If a package is not installed, it will be installed from CRAN.

First select the packages of interest

packages <- c("reshape2", "MASS", "GUniFrac", "ggpubr", "randomForest", "ggdendro", "ggrepel", "agricolae", "gridExtra", "picante", "pheatmap", "igraph", "rgexf", "ggalluvial")

Now check or install

lapply(packages, function(x) { if(!require(x, character.only = TRUE)) { install.packages(x, dependencies = TRUE) }})

if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install("ggtree")

library("writexl") library(phyloseq) library(raster) library(ggplot2) library(vegan)

--------visit-------------------------class https://chiliubio.github.io/microeco_tutorial/basic-class.html#microtable-class

The microtable class is the basic class and designed to store the basic data

for all the downstream analysis in the microeco package.

At least, the OTU table (i.e. species-sample abundance table) should be provided to create microtable object.

Thus, the microtable class can recognize the sample information table is missing

and create a default sample table according to the sample names of otu_table.

in this tutorial, we use the data inside the package microeco to show some operations.

The 16S rRNA sequencing results in the example data of the package is used to show the main part of the tutorial.

This dataset is the 16S rRNA gene Miseq sequencing results of wetland soils in China published by An et al.(An et al. 2019),

who surveyed soil bacterial communities in Chinese inland wetlands (IW),

coastal wetland (CW) and Tibet plateau wetlands (TW) using 16S rRNA gene amplicon sequencing method.

These wetlands include both saline and non-saline samples.

The sample information table have 4 columns: “SampleID,” “Group,” “Type” and “Saline.”

The column “SampleID” is same with the rownames. The column “Group” represents the IW, CW and TW.

The column “Type” represents the sampling region: northeastern region (NE), northwest region (NW), North China area (NC),

middle-lower reaches of the Yangtze River (YML), southern coastal area (SC), upper reaches of the Yangtze River (YU),

Qinghai-Tibet Plateau (QTP). The column “Saline” represents the saline soils and non-saline soils.

In this dataset, the environmental factor table is separated from the sample information table.

Another ITS sequencing dataset is also stored in the example data of the package(Gao et al. 2019).

library(microeco)

load the example data; 16S rRNA gene amplicon sequencing dataset

data(sample_info_16S) data(otu_table_16S) data(taxonomy_table_16S) data(phylo_tree_16S)

load the environmental data which is detached from sample table

data(env_data_16S)

use pipe operator in magrittr package

library(magrittr)

set.seed is used to fix the random number generation to make the results repeatable

set.seed(123)

make the plotting background same with the tutorial

library(ggplot2) theme_set(theme_bw())

Make sure that the data types of sample_table, otu_table and tax_table are all data.frame as the following part shows.

class(otu_table_16S) otu_table_16S[1:5, 1:5]

class(taxonomy_table_16S) taxonomy_table_16S[1:5, 1:3]

make the taxonomic information unified, important

taxonomy_table_16S %<>% tidy_taxonomy

class(sample_info_16S) sample_info_16S[1:5, ]

class(env_data_16S)

class(phylo_tree_16S)

In R6 class, '$new' is the original method used to create a new object of class

dataset <- microtable$new(otu_table = otu_table_16S)

If you only provide abundance table, the class can help you create a sample info table

class(dataset) dataset

Let's create a dataset microtable with more information

dataset <- microtable$new(sample_table = sample_info_16S, otu_table = otu_table_16S, tax_table = taxonomy_table_16S, phylo_tree = phylo_tree_16S) dataset

dataset$tidy_dataset() print(dataset)

dataset$tax_table %<>% base::subset(Kingdom == "kArchaea" | Kingdom == "kBacteria") print(dataset)

This will remove the lines containing the taxa word regardless of taxonomic ranks and ignoring word case in the tax_table.

So if you want to filter some taxa not considerd pollutions, please use subset like the previous operation.

dataset$filter_pollution(taxa = c("mitochondria", "chloroplast")) print(dataset)

dataset$tidy_dataset() print(dataset)

dataset$sample_sums() %>% range

As an example, we use 10000 sequences in each sample

dataset$rarefy_samples(sample.size = 10000)

dataset$sample_sums() %>% range

dataset$cal_abund()

return dataset$taxa_abund

class(dataset$taxa_abund)

dir.create("taxa_abund") dataset$save_abund(dirpath = "taxa_abund")

If you want to add Faith's phylogenetic diversity, use PD = TRUE, this will be a little slow

dataset$cal_alphadiv(PD = FALSE)

return dataset$alpha_diversity

class(dataset$alpha_diversity)

save dataset$alpha_diversity to a directory

dir.create("alpha_diversity") dataset$save_alphadiv(dirpath = "alpha_diversity")

If you do not want to calculate unifrac metrics, use unifrac = FALSE

require GUniFrac package

dataset$cal_betadiv(unifrac = TRUE)

return dataset$beta_diversity

class(dataset$beta_diversity)

save dataset$beta_diversity to a directory

dir.create("beta_diversity") dataset$save_betadiv(dirpath = "beta_diversity")

t1 <- trans_alpha$new(dataset = dataset, group = "Group")

return t1$alpha_stat

t1$alpha_stat[1:5, ]

t1$cal_diff(method = "KW")

return t1$res_alpha_diff

t1$res_alpha_diff[1:5, ]

t1$cal_diff(method = "anova")

return t1$res_alpha_diff

t1$res_alpha_diff

t1$plot_alpha(add_letter = T, measure = "Chao1", use_boxplot = FALSE)

t1$plot_alpha(pair_compare = TRUE, measure = "Chao1", shape = "Group")

t1 <- trans_alpha$new(dataset = dataset, group = "Group") t1$cal_diff(method = "anova", anova_set = "Group+Type")

now the result t1$res_alpha_diff is a list

see the help document for the usage of anova_set

we first create an trans_beta object

this operation invoke the distance matrix of bray in dataset$beta_diversity

t1 <- trans_beta$new(dataset = dataset, group = "Group", measure = "bray")

Please also cite the original paper: An et al. (2019). Soil bacterial community structure in Chinese wetlands. Geoderma, 337, 290-299.

t1$cal_ordination(ordination = "PCoA")

t1$res_ordination is the ordination result list

class(t1$res_ordination)

plot the PCoA result

t1$plot_ordination(plot_color = "Group", plot_shape = "Group", plot_group_ellipse = TRUE)

calculate and plot sample distances within groups

t1$cal_group_distance()

return t1$res_group_distance

t1$plot_group_distance(distance_pair_stat = TRUE)

calculate and plot sample distances between groups

t1$cal_group_distance(within_group = FALSE) t1$plot_group_distance(distance_pair_stat = TRUE)

use replace_name to set the label name, group parameter used to set the color

t1$plot_clustering(group = "Group", replace_name = c("Saline", "Type"))

manova for all groups

t1$cal_manova(cal_manova_all = TRUE) t1$res_manova$aov.tab

manova for each paired groups

t1$cal_manova(cal_manova_paired = TRUE) t1$res_manova

manova for specified group set: here "Group + Type"

t1$cal_manova(cal_manova_set = "Group + Type") t1$res_manova$aov.tab

ChiLiubio / microeco

Error in `colnames<-`(`*tmp*`, value = colnames(lhs)) : attempt to set 'colnames' on an object with less than two dimensions #59

manova for all groups

manova for each paired groups

manova for specified group set: here "Group + Type"

I am learning with the following tutorial

https://chiliubio.github.io/microeco/

microeco

An R package for data mining in microbial community ecology

Background

In microbial community ecology, with the development of high-throughput sequencing techniques,

the increasing data amount and complexity make the community data analysis and management a challenge.

There has been a lot of R packages created for the microbiome profiling analysis.

$However, it is still difficult to perform data mining fast and efficiently.

Therefore, we created R microeco package.

If devtools package is not installed, first install it

then install microeco

Dependence

Important packages

If a package is not installed, it will be installed from CRAN.

First select the packages of interest

Now check or install

ggtree

Plotting the cladogram from the LEfSe result requires the ggtree package in bioconductor

(https://bioconductor.org/packages/release/bioc/html/ggtree.html).

-------------------------------------------------------------------

-------------------BASIC CLASS

load the example data; 16S rRNA gene amplicon sequencing dataset

load the environmental data which is detached from sample table

use pipe operator in magrittr package

set.seed is used to fix the random number generation to make the results repeatable

make the plotting background same with the tutorial

make the taxonomic information unified, important

VKS-----Just to see the table

we create an object of microtable class.

This operation is very similar with the package phyloseq(Mcmurdie and Holmes 2013),

but microeco is more brief and simpler.

The otu_table in the microtable class must be the species-sample format: rownames must be OTU names,

colnames must be sample names.

The required sample names must be same in rownames of sample_table and colnames of otu_table.

In R6 class, '$new' is the original method used to create a new object of class

No sample_table provided, automatically use colnames in otu_table to create one ...

If you only provide abundance table, the class can help you create a sample info table

microtable class:

sample_table have 90 rows and 2 columns

otu_table have 13628 rows and 90 columns

Let's create a dataset microtable with more information

microtable class:

sample_table have 90 rows and 4 columns

otu_table have 13628 rows and 90 columns

tax_table have 13628 rows and 7 columns

phylo_tree have 14096 tips

To make the species and sample information consistent across different files in the dataset object,

we can use function tidy_dataset() to trim the dataset.

microtable class:

sample_table have 90 rows and 4 columns

otu_table have 13628 rows and 90 columns

tax_table have 13628 rows and 7 columns

phylo_tree have 13628 tips

Then, we remove OTUs which are not assigned in the Kingdom "kArchaea" or "kBacteria".

microtable class:

sample_table have 90 rows and 4 columns

otu_table have 13628 rows and 90 columns

tax_table have 13330 rows and 7 columns

phylo_tree have 13628 tips

We also remove OTUs with the taxonomic assignments “mitochondria” or “chloroplast.”

This will remove the lines containing the taxa word regardless of taxonomic ranks and ignoring word case in the tax_table.

So if you want to filter some taxa not considerd pollutions, please use subset like the previous operation.

Then, to make the OTUs same in otu_table, tax_table and phylo_tree, we use tidy_dataset() again.

then we use sample_sums() to check the sequence numbers in each sample.

[1] 10316 37087

Sometimes, in order to reduce the effects of sequencing depth on the diversity measurements,

we need to perform the resampling to make the sequence number equal for each sample.

The function rarefy_samples can invoke the function tidy_dataset automatically before and after the rarefying.

As an example, we use 10000 sequences in each sample

Then, we calculate the taxa abundance at each taxonomic rank using cal_abund().

This function return a list called taxa_abund containing several data frame of

the abundance information at each taxonomic rank. The list is stored in the microtable object automatically.

It’s worth noting that the cal_abund() function can be used to solve some complex cases,

such as supporting both the relative and absolute abundance calculation and selecting the partial taxonomic columns.

Error in `colnames<-`(`tmp`, value = colnames(lhs)) : attempt to set 'colnames' on an object with less than two dimensions #59