DerrickWood / kraken2

The second version of the Kraken taxonomic sequence classification system
MIT License
714 stars 271 forks source link

silva database: more superkingdoms than domains in SILVA #739

Closed nggvs closed 1 week ago

nggvs commented 1 year ago

Hi,

I was using the script to build the silva database:

 kraken2-build --db path/2/db  --special silva

and I realize that you convert the 'domain' rank from SILVA to 'superkingdom' in the build_silva_taxonomy.pl:

    35 $rank = "superkingdom" if $rank eq "domain";
    36 print NAMES "$node_id\t|\t$display_name\t|\t-\t|\tscientific name\t|\n";
    37 print NODES "$node_id\t|\t$parent_id\t|\t$rank\t|\t-\t|\n";

However, there are other superkingdom in the silva database which after running the script also are treated like the three domains. I share you an example:

grep 'superkingdom' taxonomy/nodes.dmp
2       |       1       |       superkingdom    |       -       |
3       |       1       |       superkingdom    |       -       |
4       |       1       |       superkingdom    |       -       |
46959   |       46958   |       superkingdom    |       -       |
47567   |       46958   |       superkingdom    |       -       |

And these taxa are:
2       |       Archaea |       -       |       scientific name |
3       |       Bacteria        |       -       |       scientific name |
4       |       Eukaryota       |       -       |       scientific name |
46959   |       Holozoa |       -       |       scientific name |
47567   |       Nucletmycea     |       -       |       scientific name |

I'm using the files (that are automatically downloaded with 16S_silva_installation.sh)

grep -E  'superkingdom|domain' data/tax_slv_ssu_138.1.txt
Archaea;        2       domain
Bacteria;       3       domain
Eukaryota;      4       domain
Eukaryota;Amorphea;Obazoa;Opisthokonta;Holozoa; 46959   superkingdom            138
Eukaryota;Amorphea;Obazoa;Opisthokonta;Nucletmycea;     47567   superkingdom            138

Maybe you can call them major clade:

Eukaryota;Amorphea;Obazoa;Opisthokonta; 46958   major_clade             138
Eukaryota;Amorphea;Obazoa;Opisthokonta;Holozoa; 46959   superkingdom            138
Eukaryota;Amorphea;Obazoa;Opisthokonta;Holozoa;Choanozoa;       46960   major_clade             138
Thomas-Bcp commented 1 year ago

The consequence of this in the Kraken report is is that D, and potentially D1, D2..., appears twice in the classification, which can be problematic for further analysis:

    R   1   root
    D   4     Eukaryota
    D1  46848       Amorphea
    D2  46957         Obazoa
    D3  46958           Opisthokonta
    D   47567             Nucletmycea
    K   47572               Fungi
    K1  47666                 Dikarya
    P   47667                   Ascomycota
nggvs commented 1 week ago

Thank you!