ChiLiubio / microeco

An R package for data analysis in microbial community ecology
GNU General Public License v3.0
199 stars 58 forks source link

When using the Silva database, plot_diff_cladogram function runs for a long time without any result or error. #115

Open wangzhichao1990 opened 2 years ago

wangzhichao1990 commented 2 years ago

Hi, When using the Silva database, plot_diff_cladogram function runs for a long time without any result or error.

My code is shown below.

library(microeco)
library(mecodev)
library(file2meco)
library(magrittr)

dataset <- qiime2meco(
  ASV_data = "feature_table/feature_table.qza",
  sample_data = "sample-metadata.tsv",
  taxonomy_data = "classification/classification_silva.qza", # or, "classification/classification_gg.qza"
  phylo_tree = "phylogeny/rooted-tree.qza",
  rep_fasta = "feature_table/repseqs.qza"
)

dataset$tax_table %<>% base::subset(Kingdom == "k__Archaea" |
                                      Kingdom == "k__Bacteria")
dataset$filter_pollution(taxa = c("mitochondria", "chloroplast"))
dataset$tidy_dataset()

dataset$sample_sums() %>% range
dataset$rarefy_samples(sample.size = 21240)
dataset$tidy_dataset()
dataset$cal_abund()

#######
# save datasets
dataset_silva <- dataset$clone()
dataset_gg <- dataset$clone()
save(dataset_silva, dataset_gg, file = "datasets.RData")

#######
t <- trans_diff$new(
  dataset = dataset,
  method = "lefse",
  group = "group",
  taxa_level = "all",
  alpha = 0.05,
)
t$plot_diff_cladogram(
  use_taxa_num = 200,
  use_feature_num = 30,
  group_order = t$res_diff$Group %>% unique()
)

When using the Greengenes database, it works well.

This compressed file is test data: datasets.zip

I compared the differences between the two databases. For the Silva database, it seems that there is more than one word in the classification name of some levels. That may be the reason, but I'm not sure.

ChiLiubio commented 2 years ago

Hi @wangzhichao1990

This is a bug coming from the chaos of taxonomy lineage with "Incertae_Sedis". Such unclear taxonomy can cause the parent taxonomy disorder as many different taxa may has same name "Incertae_Sedis". The inner filtering step did not find this. I will fix this. As a temporary solution, you can add a step to manually filter those taxa like this:

t <- trans_diff$new(
  dataset = dataset,
  method = "lefse",
  group = "group",
  taxa_level = "all",
  alpha = 0.05,
)

# add this step
t$abund_table %<>% .[!grepl("Incertae_Sedis", rownames(.)), ]

t$plot_diff_cladogram(
  use_taxa_num = 200,
  use_feature_num = 30,
  group_order = t$res_diff$Group %>% unique()
)

This is a long-standing issue in taxonomy names of SILVA and also other some database. I have tried to filter all the chaotic taxonomy in the plot_diff_cladogram function as much as possible because this operation has a high requirement on the consistency of taxonomy. Thanks very much for your finding.

Best Chi

wangzhichao1990 commented 2 years ago

@ChiLiubio Thank you for your answer.