Added reference sequences unclassified when used as queries

mihkelvaher commented 3 years ago

Following this guide: https://github.com/DerrickWood/kraken2/blob/master/docs/MANUAL.markdown#custom-databases As a test, I'm trying out sequence adding to an existing database using viral db as a base and later adding two plant contigs (~750nt and ~11k nt). After the addition, the same sequences cannot be classified.

Download taxonomy and create a small viral database:

./kraken2-build --download-taxonomy --db db_test
./kraken2-build --download-library viral --db db_test/
./kraken2-build --build --db db_test/

Check if there are any matches before adding:

-bash-4.2$ ./kraken2/kraken2 --threads 10 --db ./kraken2/db_test/ testadd1.fa --report rep1 --output out1
Loading database information... done.
1 sequences (0.00 Mbp) processed in 0.002s (28.9 Kseq/m, 22.60 Mbp/m).
  0 sequences classified (0.00%)
  1 sequences unclassified (100.00%)
-bash-4.2$ ./kraken2/kraken2 --threads 10 --db ./kraken2/db_test/ testadd2.fa --report rep2 --output out2
Loading database information... done.
1 sequences (0.01 Mbp) processed in 0.005s (11.6 Kseq/m, 126.89 Mbp/m).
  0 sequences classified (0.00%)
  1 sequences unclassified (100.00%)

Add the seqs

./kraken2-build --add-to-library ../testadd1.fa --db db_test/
./kraken2-build --add-to-library ../testadd2.fa --db db_test/
./kraken2-build --build --db db_test/

Try to classify the seqs again

-bash-4.2$ ./kraken2/kraken2 --threads 10 --db ./kraken2/db_test/ testadd1.fa --report rep1 --output out1
Loading database information... done.
1 sequences (0.00 Mbp) processed in 0.002s (28.5 Kseq/m, 22.30 Mbp/m).
  0 sequences classified (0.00%)
  1 sequences unclassified (100.00%)
-bash-4.2$ ./kraken2/kraken2 --threads 10 --db ./kraken2/db_test/ testadd1.fa --report rep2 --output out2
Loading database information... done.
1 sequences (0.00 Mbp) processed in 0.005s (12.3 Kseq/m, 9.64 Mbp/m).
  0 sequences classified (0.00%)
  1 sequences unclassified (100.00%)

Still no match. Am I doing something wrong?

Kraken version 2.1.0

mihkelvaher commented 3 years ago

Turns out I wasn't following the guide line by line. The classification and build are successful if the intermediate build step is skipped, meaning that first add all seqs of interest and only then build.

This leaves the question: could the existing database be updated over time AFTER the --build command has been run?

Edit: Got my answer from https://github.com/DerrickWood/kraken2/issues/221#issuecomment-644279600 - the database needs to be built from scratch. It would be nice to have this confirmed by the devs before closing the issue. Also, I think it's a comment worth adding to the manual.

jenniferlu717 commented 3 years ago

The database does have to be rebuilt from scratch. Removing any *.k2d files and then rebuilding will work.

Also, I believe the read sequence file has to be the LAST specified argument in your line

(i.e. when running kraken2, specify --report myreport.txt before specifying testadd1.fa)

mihkelvaher commented 3 years ago

Thanks!

After some testing, I found out that in addition to *.k2d files, seqid2taxid.map also needs to be removed in order to add new seqs to the db.

As I can see, rsync is used for downloading sequences from NCBI? Does this imply that the unbuilt database could be updated (without downloading all again) with new NCBI sequences by rerunning the command ./kraken2-build --download-library viral --db db_test/ ?

Also, it seems that the reads file can be anywhere as an argument. It's probably assumed that 'unflagged == reads file'.

jenniferlu717 commented 3 years ago

I believe you have to redownload all of the sequences. The way that library command works is that all of the sequences are put into the same library.fna file.

If you know what the new sequences are, you can download those separate and add to the database with kraken2-build --add-to-library $file as long as the sequence maps are in the taxonomy/ folder.

mihkelvaher commented 3 years ago

Thanks!

DerrickWood / kraken2

Added reference sequences unclassified when used as queries #357