ChiLiubio / microeco

An R package for data analysis in microbial community ecology
GNU General Public License v3.0
194 stars 56 forks source link

Problem with representation of dbRDA #398

Open alberjo296 opened 3 weeks ago

alberjo296 commented 3 weeks ago

Hi again Chi!

I'm using your script for doing a dbRDA, but when doing it with vegan I do not obtain the same results. Is there any kind of transformation or scalation applied to the data? because I don't understand. I have also used another software and the results are not the same, but with vegan I was expecting more similar results since this Microeco function is based on vegan. I don't know if there is any parameter established by default that I cannot change, I've looked at both libraries and I don't find it, maybe there is a parameter already established behind microeco language that I cannot control...

I hope you can help :)

Best regards

ChiLiubio commented 3 weeks ago

Hi. Could you please attach your data (or the example data in the package) and script to let me reproduce your issue? There is a step for data transformation to make the visualization better. So it is necesary to point out the difference so that I can verify the process and explain it if it is normal.

Best, Chi

alberjo296 commented 3 weeks ago

**Hi Chi,

Here is my data:**

Environmental_variables.csv Taxa.csv Metadata.csv Reads.csv

This is the script I'm using for Microeco:

Reads <- read.csv("Reads.csv", row.names=1) View(Reads) Taxa <- read.csv("Taxa.csv", row.names=1) View(Taxa) metadata <- read.csv("Metadata.csv", row.names=1) View(metadata) dataset <- microtable$new(otu_table = Reads, sample_table = metadata, tax_table = Taxa) dataset$print() dataset$tidy_dataset() test <- dataset$tax_table for(i in 1:ncol(test)){ prefix <- paste0(tolower(substr(colnames(test)[i], 1, 1)), "__") test[, i] <- gsub(prefix, "", test[, i]) test[,i] <- paste0(prefix, test[, i]) }

dataset$print()

dataset$tidy_dataset()

dataset$save_abund(dirpath = "taxa_abund")

If you want to add Faith's phylogenetic diversity, use PD = TRUE, this will be a little slow

dataset$cal_alphadiv(PD = FALSE) dataset$save_alphadiv(dirpath = "alpha_diversity")

If you do not want to calculate unifrac metrics, use unifrac = FALSE

require GUniFrac package installed

dataset$cal_betadiv(unifrac = FALSE)

save dataset$beta_diversity to a directory

env_data_16S <- read.csv("Environmental_variables.csv", row.names=1)

t1 <- trans_env$new(dataset = dataset, add_data = env_data_16S[, 1:7],) t1$cal_ordination(method = "dbRDA", use_measure = "jaccard") t1$trans_ordination(adjust_arrow_length = TRUE, max_perc_env = 1.5) t1$plot_ordination(plot_color = "Treatment") t1$cal_ordination_anova() t1$cal_ordination_envfit() t1$res_ordination_terms t1$res_ordination_axis

However, the plot I obtain is very different to that from vegan. Here is the script with vegan (the data is the same but transposing the Reads table in Excel):

Load necessary libraries

library(vegan) # For ecological analyses including db-RDA library(ggplot2) # For plotting library(dplyr) # For data manipulation library(ggrepel) # For better text label positioning

Import the CSV files

metadata <- read.csv("Metadata.csv", row.names = 1) # Assume first column is SampleID reads <- read.csv("Reads.csv", row.names = 1) # Assume first column is SampleID env_vars <- read.csv("Environmental_variables.csv", row.names = 1) # Assume first column is SampleID

Check the first few rows of each data frame to ensure correct import

head(metadata) head(reads) head(env_vars)

Ensure row names match

all(rownames(metadata) == rownames(reads)) all(rownames(metadata) == rownames(env_vars))

If the above checks return TRUE, you can proceed. If not, you may need to reorder or filter the data to match.

Combine environmental variables with metadata

combined_data <- cbind(env_vars, metadata)

Perform db-RDA, using the Jaccard distance measure

jaccard_dist <- vegdist(reads, method = "jaccard") dbrda_result <- dbrda(jaccard_dist ~ ., data = combined_data)

Summarize the db-RDA results

summary(dbrda_result)

Extract site scores for plotting

site_scores <- scores(dbrda_result, display = "sites") site_scores_df <- as.data.frame(site_scores)

Extract environmental variable scores

env_scores <- scores(dbrda_result, display = "bp") env_scores_df <- as.data.frame(env_scores)

Add treatment information from metadata

site_scores_df$Treatment <- metadata$Treatment

Plot db-RDA results

ggplot(site_scores_df, aes(x = dbRDA1, y = dbRDA2, color = Treatment)) + geom_point(size = 3) + geom_text_repel(aes(label = rownames(site_scores_df)), size = 3) + geom_segment(data = env_scores_df, aes(x = 0, y = 0, xend = dbRDA1, yend = dbRDA2), arrow = arrow(type = "closed", length = unit(0.1, "inches")), color = "black") + geom_text(data = env_scores_df, aes(x = dbRDA1, y = dbRDA2, label = rownames(env_scores_df)), color = "black", size = 3) + labs(x = "db-RDA1", y = "db-RDA2", color = "Treatment") + theme_minimal() + scale_color_brewer(palette = "Set1")

I don't know why the direction of the arrows is different and there are factors which are (positively or negatively) correlated with Microeco but not with vegan and vice-versa...

ChiLiubio commented 3 weeks ago

Hi. There are two reasons that can explain why they are different. The first is the jaccard distance in microeco is calculated using binary data, which is a little different from that in vegdist (they are same if the data for vegdist is converted to binary). So I use bray for the example. The second is the env data input is abnormal. It should not be combined with the metadata again. I have changed it in the following code.


library(microeco)

Reads <- read.csv("Reads.csv", row.names=1)
View(Reads)
Taxa <- read.csv("Taxa.csv", row.names=1)
View(Taxa)
metadata <- read.csv("Metadata.csv", row.names=1)
metadata <- data.frame(SampleID = rownames(metadata), metadata)

View(metadata)
dataset <- microtable$new(otu_table = Reads, sample_table = metadata, tax_table = Taxa)
dataset$print()
dataset$tidy_dataset()
test <- dataset$tax_table
for(i in 1:ncol(test)){
prefix <- paste0(tolower(substr(colnames(test)[i], 1, 1)), "__")
test[, i] <- gsub(prefix, "", test[, i])
test[,i] <- paste0(prefix, test[, i])
}

dataset$print()

dataset$tidy_dataset()

dataset$cal_alphadiv(PD = FALSE)
dataset$save_alphadiv(dirpath = "alpha_diversity")
# If you do not want to calculate unifrac metrics, use unifrac = FALSE
# require GUniFrac package installed

dataset$cal_betadiv(unifrac = FALSE)
# save dataset$beta_diversity to a directory

env_data_16S <- read.csv("Environmental_variables.csv", row.names=1)

t1 <- trans_env$new(dataset = dataset, add_data = env_data_16S[, 1:7])
t1$cal_ordination(method = "dbRDA", use_measure = "bray")
t1$trans_ordination(adjust_arrow_length = TRUE, max_perc_env = 1.5)
t1$plot_ordination(plot_color = "Treatment")
t1$cal_ordination_anova()
t1$cal_ordination_envfit()
t1$res_ordination_terms
t1$res_ordination_axis

test1 <- t1$plot_ordination(plot_color = "Treatment")

# However, the plot I obtain is very different to that from vegan. Here is the script with vegan (the data is the same but transposing the Reads table in Excel):
# Load necessary libraries

library(vegan) # For ecological analyses including db-RDA
library(ggplot2) # For plotting
library(dplyr) # For data manipulation
library(ggrepel) # For better text label positioning
# Import the CSV files

metadata <- read.csv("Metadata.csv", row.names = 1) # Assume first column is SampleID
metadata <- data.frame(SampleID = rownames(metadata), metadata)
reads <- read.csv("Reads.csv", row.names = 1) # Assume first column is SampleID
env_vars <- read.csv("Environmental_variables.csv", row.names = 1) # Assume first column is SampleID
# Check the first few rows of each data frame to ensure correct import

head(metadata)
head(reads)
head(env_vars)
# Ensure row names match

# all(rownames(metadata) == colnames(reads))
# all(rownames(metadata) == rownames(env_vars))
# If the above checks return TRUE, you can proceed. If not, you may need to reorder or filter the data to match.
# Combine environmental variables with metadata

combined_data <- cbind(env_vars, metadata)
# Perform db-RDA, using the Jaccard distance measure

reads %<>% t %>% as.data.frame %>% .[rownames(metadata), ]

jaccard_dist <- vegdist(reads, method = "bray")
dbrda_result <- dbrda(jaccard_dist ~ ., data = env_vars)
# Summarize the db-RDA results

summary(dbrda_result)
# Extract site scores for plotting

site_scores <- scores(dbrda_result, display = "sites")
site_scores_df <- as.data.frame(site_scores)
# Extract environmental variable scores

env_scores <- scores(dbrda_result, display = "bp")
env_scores_df <- as.data.frame(env_scores)
# Add treatment information from metadata

site_scores_df$Treatment <- metadata$Treatment
# Plot db-RDA results

test2 <- ggplot(site_scores_df, aes(x = dbRDA1, y = dbRDA2, color = Treatment)) +
geom_point(size = 3) +
geom_text_repel(aes(label = rownames(site_scores_df)), size = 3) +
geom_segment(data = env_scores_df, aes(x = 0, y = 0, xend = dbRDA1, yend = dbRDA2),
arrow = arrow(type = "closed", length = unit(0.1, "inches")),
color = "black") +
geom_text(data = env_scores_df, aes(x = dbRDA1, y = dbRDA2, label = rownames(env_scores_df)),
color = "black", size = 3) +
labs(x = "db-RDA1", y = "db-RDA2", color = "Treatment") +
theme_minimal() +
scale_color_brewer(palette = "Set1")

test2

Now test1 and test2 are similar.