MGXlab / social_niche_breadth_SNB

Calculate the Social Niche Breadth (SNB) score of all taxonomic lineages in a set of microbiomes.
MIT License
9 stars 1 forks source link

Script to calculate Social Niche Breadth

This script calculates the Social Niche Breadth (SNB) score of all taxonomic lineages in a set of microbiomes. For a description of SNB see:

von Meijenfeldt, F.A.B., Hogeweg, P. & Dutilh, B.E. A social niche breadth score reveals niche range strategies of generalists and specialists. Nat Ecol Evol (2023). https://doi.org/10.1038/s41559-023-02027-7.

We plan to keep this code updated and add new features depending on the feedback from the community. The frozen code that was used for the paper can be found at https://doi.org/10.5281/zenodo.7651594.

Usage

Run ./calculate_SNB.py -h to see a list of options.

As an input file, the script currently needs a file with taxonomic profiles that is similar to "Supplementary Data 2" in the paper. You can use "Supplementary Data 2" (table.MGnify.taxa_in_analyses.txt.gz) as an input file with default parameters to generate the SNB scores of the paper:

./calculate_SNB.py -f table.MGnify.taxa_in_analyses.txt.gz -o output_file.txt

If you format your taxonomic profiles similarly, the script will also work on your data. The input file looks like this:

taxonomic lineage MGYA00142528 (59624) MGYA00142531 (57123) MGYA00142543 (106917) ...
super kingdom.Archaea 0 0 0 ...
super kingdom.Bacteria 59624 57123 106917 ...
super kingdom.Viruses 0 0 0 ...
super kingdom.Archaea;phylum.Candidatus_Aenigmarchaeota 0 0 0 ...
... ... ... ... ...

The header should contain unique sample names and the total number of taxonomically annotated reads. Taxonomic lineages are {rank}.{taxon} joined by a semicolon. Counts are absolute read counts. The higher rank lineage counts such as those of Bacteria are the sum of all bacterial daughter lineages and possible bacterial reads that could not be annotated at a lower rank. For pairwise comparissons between microbiomes at a specific rank (default: order), only the lineages at that rank are considered. The script does not check if the profiles are correct, for example if the number of reads associated with "super kingdom.Bacteria" is equal to or larger than the sum of reads associated with bacterial daughter taxa.

Alternatively, a relative abundance table can be supplied, in which case the header should only contain unique sample names.

taxonomic lineage MGYA00142528 MGYA00142531 MGYA00142543 ...
super kingdom.Archaea 0 0 0 ...
super kingdom.Bacteria 1.0 1.0 1.0 ...
super kingdom.Viruses 0 0 0 ...
super kingdom.Archaea;phylum.Candidatus_Aenigmarchaeota 0 0 0 ...
... ... ... ... ...

If a relative abundance table is supplied instead of a read count table, the --c2 / --pairwise_comparisson_cutoff (see below) can not be set to absolute read counts.

Let us know if other input formats would be useful.

The output file looks like this:

taxonomic lineage number of samples mean relative abundance SNB score
super kingdom.Archaea 11256 0.0959699 0.5215317
super kingdom.Bacteria 22295 0.9615460 0.5627098
super kingdom.Viruses 0 nan nan
super kingdom.Archaea;phylum.Candidatus_Aenigmarchaeota 23 0.0013518 0.5174395
... ... ... ...

Installation

Just download the script and you are good to go. Python >= 3.6 is required with the Python Standard Library, NumPy, and SciPy.

Notes