Question about kmer size

gtollefson commented 4 years ago

This is a question related to a strange message in the log out file after Salmon indexing on a very small genome with my own generated transcriptome and decoy.

I'm running Salmon v1.0.0 index on the transcriptome of Candida parapsilosis which has a small genome of 26mbp. I created the transcriptome using Cufflinks gffread on my reference genome fasta and gff3. I created the decoy by concatenating the whole genome to the transcriptome as it was described in the manual.

I am running using the following options:

salmon index -t gentrome.fa.gz -d decoys.txt -p 12 -i cpar_salmon_index -k 31

After indexing using a kmer size threshold of -k 31, I see the following message in the log out file "filter size not provided. ntHll estimated 12754610 distinct k-mers, setting filter size to 2^28."

2^28 seems very high compared to 31 bp set using -k 31. I'm also curious why, after setting a k size, it printed the message "filter size not provided."

I've pasted a more complete snippet of the log out file text below. Does everything look like it's run successfully? I'm concerned since I am running on a small genome and with my own generated decoy and transcriptome. Does it look like this running as it should, or is there a bug that I should provide more details about?

[puff::index::jointLog] [info] Filter size not provided; estimating from number of distinct k-mers [puff::index::jointLog] [info] ntHll estimated 1275461 0 distinct k-mers, setting filter size to 2^28 Threads = 12 Vertex length = 31 Hash functions = 5 Filter size = 268435456 Capacity = 2 Files: cpar_salmon_index/ref_k31_fixed.fa

Round 0, 0:268435456 Pass Filling Filtering 1 0 8
2 4 0 True junctions count = 18712 False junctions count = 40617 Hash table size = 59329 Candidate marks count = 141326

rob-p commented 4 years ago

Hi @gtollefson,

To understand the output here a little bit better, it is necessary to understand what salmon is doing during indexing. My understanding is that you provided salmon with a transcriptome and (small) decoy reference genome, and asked it to build an index with k-mer size k=31.

When you do this, salmon will do a few things. First, it will go over your input gentrome file, replace ambiguous characters (e.g. N) with pseudo-random nucleotides. It will also report any transcripts smaller than the chosen k-mer size, and it will detect and remove (unless --keepDuplicates is passed) any identical / duplicate sequences.

After all of this, it will begin constructing the index in earnest. This is done by running a modified version of TwoPaCo to construct the compacted colored de Bruijn graph on the gentrome, which is then indexed using our pufferfish index.

The TwoPaCo algorithm that generates the compacted colored de Bruijn graph from the input sequence is based on a very elegant algorithm that couples a Bloom filter with an exact hash table, and makes two (or more) passes over the input to identify all of the junctions in the reference (which directly implies all unitigs). To make the algorithm work efficiently, one needs to have an estimate for the number of distinct k-mers that will be encountered in the reference sequence. If the estimate is too big, one wastes memory. If the estimate is too small, the Bloom filter is not big enough, it doesn't filter efficiently, and the algorithm ends up putting way too much data in the exact hash table.

In order to determine how to set the Bloom filter size appropriately, we take the following approach. If the Bloom filter size isn't provided directly (note: this is not the same as the k-mer size, this is an estimate of the total number of distinct k-mers in the entire input data), then we make a call to a function defined in the ntCard library. This is a program designed specifically for cardinality estimation of k-mers in sequencing data. Based on the estimated number of distinct k-mers, we use the standard equations (derived from the theory behind Bloom filters) to set the Bloom filter to be of the smallest possible size that still achieves a relatively low, pre-specified, false positive rate.

The message you are seeing is that the estimates suggest the Bloom filter should be of size 2^28 bits, which is ~ 33.55MB — pretty small, actually. This is because ntCard estimated 12,754,610 distinct k-mers (31-mers) in your input dataset, which doesn't seem unreasonable.

In short, I think the output you observe here seems completely reasonable and in line with the data you are providing. However, I understand how all of this output might to make sense if you're not familiar with everything going on behind the scenes. Please let me know if this answers your question.

--Rob

gtollefson commented 4 years ago

@rob-p Thank you very much for the clear explanation. That answers my question.

COMBINE-lab / salmon

Question about kmer size #510