DerrickWood / kraken2

The second version of the Kraken taxonomic sequence classification system
MIT License
707 stars 271 forks source link

Can I add some new seqs into an already-built database, without do it from scratch? #212

Closed depancao closed 4 years ago

depancao commented 4 years ago

Can I add some new seqs into an already-built database, without do it from scratch? I see https://github.com/DerrickWood/kraken2/issues/45 says no to this question. But I want to know if I can inspect an already-built database, add some new seqs, then re-build it? Downloading every genomes properly for giant DB like maxikraken2 is impossible for me.

sconlan commented 4 years ago

I have had success running the --download-library XYZ commands first to build up the library directories. After that, I just run something like: kraken2-build --build --db ./ --threads 24

I see no reason why you couldn't remove the database files from an already built database:

hash.k2d  opts.k2d  seqid2taxid.map  taxo.k2d

Then use the "kraken2-build --add-to-library" command to add your updated genomes and then rerun the build command. You'd still have to wait for the build to rerun but you wouldn't have to wait for the downloads.

depancao commented 4 years ago

I have had success running the --download-library XYZ commands first to build up the library directories. After that, I just run something like: kraken2-build --build --db ./ --threads 24

I see no reason why you couldn't remove the database files from an already built database:

hash.k2d  opts.k2d  seqid2taxid.map  taxo.k2d

Then use the "kraken2-build --add-to-library" command to add your updated genomes and then rerun the build command. You'd still have to wait for the build to rerun but you wouldn't have to wait for the downloads.

I'd like to use all refseq bacteria genomes, which is 180990 fna.gz files. Building such big database also takes lots of computing resources.

What I want to do is to add some more human polymorphism to the maxikraken DB, to overcome the disadvantage of mis-classifying of human reads to Mycobacterium Tuberculosis. As I know, kraken2 DB do not save original seqs, but save de-redundant k-mers and their LCA origin. So, I want to know, is it possible to simply add some more seqs or k-mers to a well-built kraken2 DB, without build it from seqs?

jenniferlu717 commented 4 years ago

Unfortunately, the answer is no, you cannot simply add sequences/kmers to an already-built database. You will need to rebuild the database.

The reason for rebuilding is because of how the kmers are saved in memory. If your new database have kmers that belong inbetween existing ones, it cannot simply shift the kmers to a new memory space.