Selective k-mer generation?

fbreitwieser / krakenuniq

🐙 KrakenUniq: Metagenomics classifier with unique k-mer counting for more specific results

GNU General Public License v3.0

222 stars 44 forks source link

Dear KrakenUniq team,

I have a theoretical question - I was wondering if you could suggest something here. I am using KrakenUniq to classify bacterial reads from RNA-seq experiments. The database is a bit too large, and with --max-db-size, from what I understand, you select a fraction of random k-mers (e.g. every 2nd or 3rd k-mer). Would it be possible to use all of the k-mers from rRNA portions of the genome, but scale down the rest of the database? I have a RefSeq annotation, so I can split the fasta into rRNA/non-rRNA parts, but I am not sure how can one mix the two.

Thank you in advance!

-- Alex

The only way to use all the rRNA k-mers and scale down the rest would be to make your own custom FASTA files for the genomes, sampling the k-mers yourself, and then build a database with that. You could include the full rRNA sequence and then choose a fraction of the kmers for the rest. A simpler strategy is just to use the new parameter that allows you to use a huge database with any amount of RAM, --preload-size. I've run a 420GB database on a laptop with 32GB of RAM this way, by using --preload-size 20G. Then krakenuniq will just read in 20GB of the database, classify all the reads and save a temp file, and then read the next 20G and re-classify the reads, etc. It's a bit slower but still quite fast.

fbreitwieser / krakenuniq

Selective k-mer generation? #186