Open lutfia95 opened 4 years ago
dashing hll
simply computes the cardinality of all sequences provided to it, which I don't think is what you want.
If you want to know how many unique k-mers overlapped, then you'd compute dashing cmp -k15 -p2 --sizes read.fastq reference.fasta
or dashing cmp --wj-exact -k15 -p2 --sizes read.fastq reference.fasta
.
--sizes
means it emits the number of unique k-mers in the intersection, and --wj-exact
means it emits the total number of k-mers, not the unique number of k-mers that overlap. Does that help?
that helps thanks, I have also a question to be sure how can I explain the output:
./dashing cmp -k31 -p2 --sizes read.fastq reference_.fasta
reference.fasta 2824048 read.fastq 1623
reference.fasta - 1623.46 oneread.fastq - -
2824048: the number of k-mer's in my reference 1623: the number of k-mer's in my read 1623.46 Is it the total number of k-mers that overlap? because I am not sure about this number exactly.
If I run:
./dashing cmp --wj-exact -k31 -p2 read.fastq reference.fasta
Is the last ouput : 0.00054983 Could you please explain to me, what is the both outpus mean? Thanks,
Hi,
The first one means that by its estimate, the smaller sequence is almost entirely contained in the larger sequence.
The second command-line says that 0.05% of the k-mers in the union are shared. If you were to add --sizes
to it, it would emit something close to 1623. (1623 / 2824048 ~= .0005)
--sizes causes the number emitted to be an approximate number of k-mers, while the default is jaccard similarity (fraction of shared k-mers).
If you want to get rid of the randomness from the sketch, you can --use-full-khash-sets
.
Hi,
I have a question about the output from HLL, when I use Dashing with HyperLogLog i.e.: ./dashing hll -k15 -p2 -S24 read.fastq reference.fasta
The output from HLL is then: Estimated number of unique exact matches: 2925637.000000
Which kind of matches counts HLL?, I thought the k-mer matches between the Inputs (read and reference). If the HLL counts the K-mer matches, it shouldn't be 2925637, because my read length is 1628 bp and the reference about 3000000 bp.
My goal is to count the k-mer matches between read and reference. Are the counted matches in HLL between k-mer's or other kind of matches?
Best,
Ahmad