fbreitwieser / pavian

🌈 Interactive analysis of metagenomics data
https://doi.org/10.1093/bioinformatics/btz715
392 stars 76 forks source link

Feature request: Kaiju reports support #11

Closed TimSkvortsov closed 6 years ago

TimSkvortsov commented 7 years ago

Hi,

I was wondering if it would be possible to add support for reports produced by Kaiju. Pavian is an amazing piece of visualisation software, thank you very much for coding it.

fbreitwieser commented 7 years ago

Hi @TimofeySkvortsov , I think it is a good idea. Can you provide some information on the report format of Kaiju?

TimSkvortsov commented 7 years ago

The output format is described in the Kaiju's manual: https://github.com/bioinformatics-centre/kaiju#output-format

There are two report types that could be generated by Kaiju, default and verbose, both are somewhat similar to the Kraken's output format.

Kaiju's default report has three columns separated by tabs:

C   D00420:130:H2TWLBCXY:1:1101:8346:2236   1333523
C   D00420:130:H2TWLBCXY:1:1101:11269:2189  756883
C   D00420:130:H2TWLBCXY:1:1101:14118:2184  186196
C   D00420:130:H2TWLBCXY:1:1101:14463:2150  86177
U   D00420:130:H2TWLBCXY:1:1101:14318:2153  0
C   D00420:130:H2TWLBCXY:1:1101:14663:2169  2157
C   D00420:130:H2TWLBCXY:1:1101:16736:2123  131567
C   D00420:130:H2TWLBCXY:1:1101:18392:2654  86177
U   D00420:130:H2TWLBCXY:1:1101:20170:2563  0

Kaiju's verbose report has seven columns separated by tabs, the first three are the same as in the default report:

C   D00420:130:H2TWLBCXY:1:1101:1400:2243   1644061 11  44470,222984,253107,406552,1227497, AFO59089.1,ELY59679.1,WP_007107841.1,WP_007258503.1,WP_083909286.1, HSDDFSRRTYE,
U   D00420:130:H2TWLBCXY:1:1101:8960:2181   0
C   D00420:130:H2TWLBCXY:1:1101:11269:2189  756883  14  756883, WP_014051658.1, VRFGTESGVRADMQ,VRFGTESGVRADMQ,
C   D00420:130:H2TWLBCXY:1:1101:14065:2135  2237    44  2237,   WP_004958731.1,WP_008307830.1,  MIELLYAISTLVFVVAGLTMVGMAMRAYVQTSRQAMLHLSVGFS,
U   D00420:130:H2TWLBCXY:1:1101:14318:2153  0
U   D00420:130:H2TWLBCXY:1:1101:14296:2174  0
C   D00420:130:H2TWLBCXY:1:1101:17483:2230  1744    11  1744,   WP_055345270.1, LRSGRTARRPR,LRSGRTARRPR,
U   D00420:130:H2TWLBCXY:1:1101:18119:2213  0
U   D00420:130:H2TWLBCXY:1:1101:19659:2188  0
C   D00420:130:H2TWLBCXY:1:1101:20071:2103  1194090 31  1194090,    WP_073062635.1, GIPPLAGFFSKDEILAFTFNAGFGEFAGSLY,GIPPLAGFFSKDEILAFTFNAGFGEFAGSLY,

The columns are:

  1. either C or U, indicating whether the read is classified or unclassified.
  2. name of the read
  3. NCBI taxon identifier of the assigned taxon
  4. the length or score of the best match used for classification
  5. the taxon identifiers of all database sequences with the best match
  6. the accession numbers of all database sequences with the best match
  7. matching fragment sequence(s)

Hope it helps.

fbreitwieser commented 7 years ago

How does the kaiju classification summary generated by kaijuReport look like? Currently Pavian needs the taxonomy information to be in the result file.

Esp with the option -p to print the full taxon path instead of just the taxon name.

TimSkvortsov commented 7 years ago

It looks something like this:


        %       reads   phylum
-------------------------------------------
49.669548     2600324   cellular organisms; Archaea; Euryarchaeota; 
19.288523     1009802   cellular organisms; Bacteria; Proteobacteria; 
 5.381846      281753   cellular organisms; Bacteria; Terrabacteria group; Actinobacteria; 
 1.475155       77228   cellular organisms; Bacteria; FCB group; Bacteroidetes/Chlorobi group; Bacteroidetes; 
 1.220038       63872   cellular organisms; Archaea; DPANN group; Candidatus Nanohaloarchaeota; 
 1.154883       60461   cellular organisms; Bacteria; Balneolaeota; 
 1.129727       59144   cellular organisms; Eukaryota; Opisthokonta; Fungi; Dikarya; Ascomycota; 
 1.114159       58329   cellular organisms; Bacteria; Terrabacteria group; Firmicutes; 
 0.705105       36914   cellular organisms; Eukaryota; Opisthokonta; Fungi; Dikarya; Basidiomycota; 
 0.264190       13831   cellular organisms; Eukaryota; Alveolata; Apicomplexa; 
 0.217678       11396   cellular organisms; Bacteria; Terrabacteria group; Cyanobacteria/Melainabacteria group; Cyanobacteria; 
 0.160814        8419   cellular organisms; Eukaryota; Viridiplantae; Chlorophyta; 
 0.126947        6646   cellular organisms; Bacteria; Terrabacteria group; Chloroflexi; 
 0.124980        6543   cellular organisms; Bacteria; PVC group; Planctomycetes; 
 0.112144        5871   cellular organisms; Bacteria; Acidobacteria; 
######### here I removed several rows #########
 0.000019           1   cellular organisms; Eukaryota; Stramenopiles; PX clade; Xanthophyceae; 
 0.000019           1   cellular organisms; Bacteria; unclassified Bacteria; Bacteria candidate phyla; Patescibacteria group; Parcubacteria group; Candidatus Jacksonbacteria; 
-------------------------------------------
 0.551168       28855   Viruses
16.250864      850773   cannot be assigned to a phylum 
-------------------------------------------
31.413546     2397816   unclassified
fbreitwieser commented 7 years ago

Thanks for the info, I'll add it in the next version of pavian

TimSkvortsov commented 7 years ago

Great, thank you very much!

fbreitwieser commented 6 years ago

Hi @TimofeySkvortsov , sorry for the late response. I did not have success so far importing the report file itself, but I think the output file can be easily converted into a Kraken-style report. Can you try kraken-report on the kaiju output file, with the --db argument pointing to the parent directory of the NCBI taxonomy dump?

devindrown commented 6 years ago

I tried this and it works on the raw output from Kaiju.

fbreitwieser commented 6 years ago

Thanks for the testing, @devindrown ! I'll add a section to the README

mbhall88 commented 5 years ago

Just to clarify: When you say

with the --db argument pointing to the parent directory of the NCBI taxonomy dump

Do you mean the kraken NCBI taxonomy dump?

Because when I point it at the kaiju database directory I get the error

kraken-report: database ("kaijudb/") does not contain necessary file database.kdb
tonallint commented 5 years ago

Just to clarify: When you say

with the --db argument pointing to the parent directory of the NCBI taxonomy dump

Do you mean the kraken NCBI taxonomy dump?

Because when I point it at the kaiju database directory I get the error

kraken-report: database ("kaijudb/") does not contain necessary file database.kdb

I have the same problem.

ctlshcxy commented 4 years ago

Just to clarify: When you say

with the --db argument pointing to the parent directory of the NCBI taxonomy dump

Do you mean the kraken NCBI taxonomy dump? Because when I point it at the kaiju database directory I get the error

kraken-report: database ("kaijudb/") does not contain necessary file database.kdb

I have the same problem.

Excuse me, have you solved this problem? I met the same problem.

valeriafloral commented 3 years ago

Just to clarify: When you say

with the --db argument pointing to the parent directory of the NCBI taxonomy dump

Do you mean the kraken NCBI taxonomy dump? Because when I point it at the kaiju database directory I get the error

kraken-report: database ("kaijudb/") does not contain necessary file database.kdb

I have the same problem.

Excuse me, have you solved this problem? I met the same problem.

I point it to a database that I used with Krakenuniq and it works.

I used:

kraken-report --db path/to/db path/to/kaiju.out > path/to/kaiju.tsv

ls to the KrakenUniq Database:

database.kdb
database.idx
taxonomy/nodes.dmp
taxonomy/names.dmp