Automated description of community composition

Great package, great time saver, love it!

Describe the solution you'd like

I'm a microbial ecologist, as such I always need to describe composition of the community in samples and/or samples group. Basically, the graphical output is usually stacked bar plot, to 100% abundance, with a textual description of the most abundant taxon/species in each sample. Underlying data are in the form of a data.frame with species (or taxon) as rows and samples as columns. Each number is the relative abundance of a taxon in a sample

Example data, adapted from https://stackoverflow.com/questions/38452577/making-stack-bar-plot-of-bacterial-abundance

df <- data.frame(
  sample1=c(0.0084246282,0.41627099,0.55475503,0,0.000724518,5.391762e-05,0.01977092),
  sample2=c(0.0168571327,0.132988, 0.80289437, 3.560112e-05, 0.004272135, 0.04238314, 0.000569618),
  sample3=c(0.0020299288,0.53813817,0.42367947, 0.03311006, 0.0007978327, 3.534702e-05, 0.002209189),
  row.names = c("Actinobacteria", "Bacteroidetes", "Firmicutes", "Fusobacteria", "Proteobacteria", "Verrucomicrobia", "Other"))

> df
                     sample1      sample2      sample3
Actinobacteria  8.424628e-03 1.685713e-02 2.029929e-03
Bacteroidetes   4.162710e-01 1.329880e-01 5.381382e-01
Firmicutes      5.547550e-01 8.028944e-01 4.236795e-01
Fusobacteria    0.000000e+00 3.560112e-05 3.311006e-02
Proteobacteria  7.245180e-04 4.272135e-03 7.978327e-04
Verrucomicrobia 5.391762e-05 4.238314e-02 3.534702e-05
Other           1.977092e-02 5.696180e-04 2.209189e-03

I would like to automatically describe the top N species (i.e. the most abundant, in row) in each sample. For example the data could then be reported as:

"Firmicutes (55.47%) and Bacteroidetes (41.63%), made up almost the entire bacterial community in sample1 (97.1), while Fusobacteria was absent" or similar.

How could we do it? Don't really know how, should be related to the report.data.frame() function. I hope it is not too narrow in scope as a request/feature proposal, generalizing it is a report.data.frame() method extension which highlight top N features instead of mean or other statistics. Using report() on the above produce

> report(df, range = T,distribution = T,dispersion = T,centrality = T)
The data contains 7 observations of the following variables:
  - sample1: Mean = 0.14, SD = 0.24, Median = 0.01, MAD = 0.01, range: [0, 0.55], Skewness = 1.35, Kurtosis = -0.79, 0 missing
  - sample2: Mean = 0.14, SD = 0.29, Median = 0.02, MAD = 0.02, range: [0.00, 0.80], Skewness = 2.51, Kurtosis = 1.92, 0 missing
  - sample3: Mean = 0.14, SD = 0.23, Median = 0.00, MAD = 0.00, range: [0.00, 0.54], Skewness = 1.30, Kurtosis = -0.88, 0 missing

Sorry for the late answer!

That's rather tricky, as there report.data.frame tends to summarize variables (compute mean, SD etc.), and where what we'd need is directly reporting the values.

One option could be to create a new function report_values or something like that that would simply report the values according to some groups, but we'd need to think about it and of the generalizability of the usage of such function.

A start would be to reformat your table to have in a "long" format:

library(tidyverse)

df <- data.frame(
  sample1=c(0.0084246282,0.41627099,0.55475503,0,0.000724518,5.391762e-05,0.01977092),
  sample2=c(0.0168571327,0.132988, 0.80289437, 3.560112e-05, 0.004272135, 0.04238314, 0.000569618),
  sample3=c(0.0020299288,0.53813817,0.42367947, 0.03311006, 0.0007978327, 3.534702e-05, 0.002209189),
  row.names = c("Actinobacteria", "Bacteroidetes", "Firmicutes", "Fusobacteria", "Proteobacteria", "Verrucomicrobia", "Other"))

df %>% 
  tibble::rownames_to_column("Species") %>% 
  tidyr::pivot_longer(2:4, names_to="Sample", values_to="N") %>% 
  dplyr::arrange(Sample, desc(N))
#> # A tibble: 21 x 3
#>    Species         Sample          N
#>    <chr>           <chr>       <dbl>
#>  1 Firmicutes      sample1 0.555    
#>  2 Bacteroidetes   sample1 0.416    
#>  3 Other           sample1 0.0198   
#>  4 Actinobacteria  sample1 0.00842  
#>  5 Proteobacteria  sample1 0.000725 
#>  6 Verrucomicrobia sample1 0.0000539
#>  7 Fusobacteria    sample1 0        
#>  8 Firmicutes      sample2 0.803    
#>  9 Bacteroidetes   sample2 0.133    
#> 10 Verrucomicrobia sample2 0.0424   
#> # ... with 11 more rows

^{Created on 2020-03-14 by the reprex package (v0.3.0)}

easystats / report

Automated description of community composition #71