Ivarz / Conifer

Calculate confidence scores from Kraken2 output
BSD 2-Clause "Simplified" License
21 stars 7 forks source link
confidence-scores kraken2 metagenomic-analysis

Dependencies

gcc

zlib

cmake if building tests

Building

git clone https://github.com/Ivarz/Conifer && cd Conifer
git submodule update --init --recursive
make

To build tests use

make tests

Building docker image

To build docker image follow instructions at conifer-docker (thanks to @Midnighter).

Basic usage

To use this tool you need standard output file from kraken2 and taxonomy database file (taxo.k2d). The following command will calculate confidence score for each classified read. Note that this kind of output does not include header. For paired end reads confidence score for both reads and the average of the two reads is reported. Only classified reads are reported by default.

./conifer -i test_files/example.out.txt -d test_files/taxo.k2d
Kraken standard output read1 confidence score read2 confidence score average confidence score
C V100006960L1C001R001000420 853 100|100 0:16 853:8 1783272:2 748224:2 1783272:2 168384:5 186801:6 0:2 168384:5 0:18 |:| 748224:7 0:2 748224:5 0:21 853:4 748224:7 0:5 748224:3 0:12 0.1515 0.3939 0.2727

Use --rtl option to obtain RTL scores

./conifer --rtl -i test_files/example.out.txt -d test_files/taxo.k2d
Kraken standard output read1 RTL score read2 RTL score average RTL score
C V100006960L1C001R001000420 853 100|100 0:16 853:8 1783272:2 748224:2 1783272:2 168384:5 186801:6 0:2 168384:5 0:18 |:| 748224:7 0:2 748224:5 0:21 853:4 748224:7 0:5 748224:3 0:12 0.3636 0.6364 0.5000

Use --both_scores option to obtain confidence and RTL scores simultaneously.

./conifer --both_scores -i test_files/example.out.txt -d test_files/taxo.k2d
Kraken standard output read1 confidence score read2 confidence score average confidence score read1 RTL score read2 RTL score average RTL score
C V100006960L1C001R001000420 853 100|100 0:16 853:8 1783272:2 748224:2 1783272:2 168384:5 186801:6 0:2 168384:5 0:18 |:| 748224:7 0:2 748224:5 0:21 853:4 748224:7 0:5 748224:3 0:12 0.1515 0.3939 0.2727 0.3636 0.6364 0.5000
./conifer -i test_files/example.out.txt -d test_files/taxo.k2d

To calculate 25th, 50th and 75th percentiles of the confidence score for each assigned taxonomy use -s option. For paired end reads, average score of each pair is summarized. For the sake of brevity, only first 5 lines of the summary are shown.

./conifer -s -i test_files/example.out.txt -d test_files/taxo.k2d
taxon_name taxid reads P25 P50 P75
Faecalibacterium prausnitzii 853 3 0.2200 0.2730 0.4320
Anaerobutyricum hallii 39488 1 0.5000 0.5000 0.5000
Lachnospiraceae 186803 1 0.5000 0.5000 0.5000
Clostridiales 186802 3 0.4920 0.7200 1.0000

Similar report can be generated for RTL scores:

./conifer --rtl -s -i test_files/example.out.txt -d test_files/taxo.k2d
taxon_name taxid reads P25 P50 P75
Faecalibacterium prausnitzii 853 3 0.3480 0.3480 0.4320
Anaerobutyricum hallii 39488 1 0.5000 0.5000 0.5000
Lachnospiraceae 186803 1 0.5000 0.5000 0.5000
Clostridiales 186802 3 0.7200 1.0000 1.0000

and simultaneous reporting of both scores:

./conifer --both_scores -s -i test_files/example.out.txt -d test_files/taxo.k2d
taxon_name taxid reads P25_conf P50_conf P75_conf P25_rtl P50_rtl P75_rtl
Faecalibacterium prausnitzii 853 3 0.2200 0.2730 0.4320 0.3480 0.3480 0.4320
Anaerobutyricum hallii 39488 1 0.5000 0.5000 0.5000 0.5000 0.5000 0.5000
Lachnospiraceae 186803 1 0.5000 0.5000 0.5000 0.5000 0.5000 0.5000
Clostridiales 186802 3 0.4920 0.7200 1.0000 0.7200 1.0000 1.0000

Note on score calculation

Schematic representation of confidence and RTL score calculation from classification tree. White nodes represent the final assigned taxonomy. Numbers indicate read k-mer count assigned to a particular taxonomy. Confidence score is calculated as the fraction of k-mers assigned to the final taxonomy and its descendants, as denoted by the blue rectangle (left); RTL score is calculated from descendants and ascendants of the final taxonomy (right).