dnbaker / dashing

Fast and accurate genomic distances using HyperLogLog
GNU General Public License v3.0
161 stars 11 forks source link

Output HyperLogLog #60

Open lutfia95 opened 3 years ago

lutfia95 commented 3 years ago

Hi,

I have a question about the output from HLL, when I use Dashing with HyperLogLog i.e.: ./dashing hll -k15 -p2 -S24 read.fastq reference.fasta

The output from HLL is then: Estimated number of unique exact matches: 2925637.000000

Which kind of matches counts HLL?, I thought the k-mer matches between the Inputs (read and reference). If the HLL counts the K-mer matches, it shouldn't be 2925637, because my read length is 1628 bp and the reference about 3000000 bp.

My goal is to count the k-mer matches between read and reference. Are the counted matches in HLL between k-mer's or other kind of matches?

Best,

Ahmad

dnbaker commented 3 years ago

dashing hll simply computes the cardinality of all sequences provided to it, which I don't think is what you want.

If you want to know how many unique k-mers overlapped, then you'd compute dashing cmp -k15 -p2 --sizes read.fastq reference.fasta or dashing cmp --wj-exact -k15 -p2 --sizes read.fastq reference.fasta.

--sizes means it emits the number of unique k-mers in the intersection, and --wj-exact means it emits the total number of k-mers, not the unique number of k-mers that overlap. Does that help?

lutfia95 commented 3 years ago

that helps thanks, I have also a question to be sure how can I explain the output:

./dashing cmp -k31 -p2 --sizes read.fastq reference_.fasta

Path Size (est.)

reference.fasta 2824048 read.fastq 1623

Names reference_.fasta read.fastq

reference.fasta - 1623.46 oneread.fastq - -


2824048: the number of k-mer's in my reference 1623: the number of k-mer's in my read 1623.46 Is it the total number of k-mers that overlap? because I am not sure about this number exactly.

If I run:

./dashing cmp --wj-exact -k31 -p2 read.fastq reference.fasta

Is the last ouput : 0.00054983 Could you please explain to me, what is the both outpus mean? Thanks,

dnbaker commented 3 years ago

Hi, The first one means that by its estimate, the smaller sequence is almost entirely contained in the larger sequence. The second command-line says that 0.05% of the k-mers in the union are shared. If you were to add --sizes to it, it would emit something close to 1623. (1623 / 2824048 ~= .0005)

--sizes causes the number emitted to be an approximate number of k-mers, while the default is jaccard similarity (fraction of shared k-mers).

If you want to get rid of the randomness from the sketch, you can --use-full-khash-sets.