ChiLiubio / microeco

An R package for data analysis in microbial community ecology
GNU General Public License v3.0
204 stars 58 forks source link

LEfSe features marked different group issue #53

Closed Irisescat closed 3 years ago

Irisescat commented 3 years ago

Hi, Thanks for the helpful package of microbiome study. When I tried to compare the features of three different groups (no subgroups) with LEfSe on the microeco and the Huttenhower galaxy server(using the all-against-all mode), using LDA > 3.0. I noticed the results are not similar, the features marked as different Group. I'm not sure if this is correct, and hope someone can help me to clarify the difference.

LDA_Cladogram

Galaxy10- D)_Plot_Cladogram_on_data_8

sessionInfo() `R version 4.1.0 (2021-05-18) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 21.04

Matrix products: default BLAS: /usr/local/lib/R/lib/libRblas.so LAPACK: /usr/local/lib/R/lib/libRlapack.so

locale: [1] LC_CTYPE=C.UTF-8 LC_NUMERIC=C [3] LC_TIME=C.UTF-8 LC_COLLATE=C.UTF-8 [5] LC_MONETARY=C.UTF-8 LC_MESSAGES=C.UTF-8 [7] LC_PAPER=C.UTF-8 LC_NAME=C.UTF-8 [9] LC_ADDRESS=C.UTF-8 LC_TELEPHONE=C.UTF-8 [11] LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C.UTF-8

attached base packages: [1] stats graphics grDevices utils datasets methods base

other attached packages: [1] magrittr_2.0.1 r2excel_1.0.0 xlsx_0.6.5 RColorBrewer_1.1-2 [5] tibble_3.1.4 pheatmap_1.0.12 ggtree_3.0.4 ggplot2_3.3.5 [9] file2meco_0.1.1 microeco_0.5.1 tidyr_1.1.3.9000 dplyr_1.0.7 [13] phyloseq_1.36.0

loaded via a namespace (and not attached): [1] Biobase_2.52.0 jsonlite_1.7.2 splines_4.1.0 [4] foreach_1.5.1 assertthat_0.2.1 stats4_4.1.0 [7] yulab.utils_0.0.2 xlsxjars_0.6.1 GenomeInfoDbData_1.2.6 [10] pillar_1.6.2 lattice_0.20-44 glue_1.4.2 [13] XVector_0.32.0 colorspace_2.0-2 ggfun_0.0.3 [16] Matrix_1.3-3 plyr_1.8.6 pkgconfig_2.0.3 [19] zlibbioc_1.38.0 purrr_0.3.4 patchwork_1.1.1 [22] tidytree_0.3.4 scales_1.1.1 ggplotify_0.1.0 [25] mgcv_1.8-35 generics_0.1.0 IRanges_2.26.0 [28] ellipsis_0.3.2 withr_2.4.2 BiocGenerics_0.38.0 [31] lazyeval_0.2.2 survival_3.2-11 crayon_1.4.1 [34] fansi_0.5.0 nlme_3.1-152 MASS_7.3-54 [37] vegan_2.5-7 tools_4.1.0 data.table_1.14.0 [40] lifecycle_1.0.0 stringr_1.4.0 aplot_0.1.0 [43] Rhdf5lib_1.14.2 S4Vectors_0.30.0 munsell_0.5.0 [46] cluster_2.1.2 Biostrings_2.60.2 ade4_1.7-17 [49] compiler_4.1.0 GenomeInfoDb_1.28.4 gridGraphics_0.5-1 [52] rlang_0.4.11 rhdf5_2.36.0 grid_4.1.0 [55] RCurl_1.98-1.4 iterators_1.0.13 rhdf5filters_1.4.0 [58] biomformat_1.20.0 rstudioapi_0.13 igraph_1.2.6 [61] bitops_1.0-7 gtable_0.3.0 codetools_0.2-18 [64] multtest_2.48.0 DBI_1.1.1 reshape2_1.4.4 [67] R6_2.5.1 utf8_1.2.2 treeio_1.16.2 [70] permute_0.9-5 ape_5.5 rJava_1.0-4 [73] stringi_1.7.4 parallel_4.1.0 Rcpp_1.0.7 [76] vctrs_0.3.8 tidyselect_1.1.1 `

LEfSe_Issue.zip

ChiLiubio commented 3 years ago

Hi, @Irisescat

Thanks for your finding on the difference. I find that the input data have differences between R version and Huttenhower galaxy python version. For example, in the Galaxy_LEfSe/lefse_format.txt, the first row of taxa is kUnclassified|pOther|cOther|oOther|fOther|gOther|s__Other. In your scripts, after using tidy_taxonomy() and cal_abund(), the taxa abundance table can have no such ‘Other’ or ‘unclassified’, as those information are chaotic and may affect some following data analysis. So, there is a strict checking on the taxonomic info with tidy_taxonomy() to filter those classification. The input of lefse in microeco depend on the taxa_abund from cal_abund() function. Now, without the consistency of the input data and your microtable object dataset, I can not repeat the steps and make sure whether those difference actually come from the input data inconsistency or other points. Could you generate a ‘lefse_format.txt’ by saving the taxa_abund after using cal_abund() and try again in the Huttenhower galaxy ? If this issue still occur, please feel free to tell me and send me your microtable object dataset by using save function.

Chi

Irisescat commented 3 years ago

Thanks for your help. I generated a species level ‘lefse_format.txt’ after using cal_abund() and tried again in the Huttenhower galaxy. In microeco, I tried to using the same input data by setting rf_taxa_level = "Species". Still the same issue.

lefse_format.txt as Huttenhower galaxy input. lefse_format.txt microtable object dataset. meco_qiime2_sub.RData.zip

more details. LEfSe_Issue_2.zip