dnbaker / dashing

Fast and accurate genomic distances using HyperLogLog
GNU General Public License v3.0
160 stars 11 forks source link

compare with HLL #61

Open lutfia95 opened 3 years ago

lutfia95 commented 3 years ago

Hi, I have a question about the output from ./dashing cmp –k10 -p5 -C -S24 --wj-exact reference_.fasta reads.fastq The output i.e. 0.0005 (says that 0.05% of the k-mers in the union are shared)

I was surprised as I used the same command-line with different sizes of k-mers, the number of shared k-mers are increasing, but actually should decreasing. The results is:

k-mer size = 9 --> 06.66655 % are shared k-mer size = 10 --> 24.9956 % are shared k-mer size = 11 --> 60.9045 % are shared k-mer size = 12 --> 97.622 % are shared k-mer size = 13 --> 76.7513 % are shared k-mer size = 14 --> 39.642 % are shared

I am wondering that the number of shared k-mers between 9 and 12 (k-mer size) are increasing, but actually I expected that the number of shared k-mers will be at 9 the maximum and then after 9 will start decreasing.

Dataset: the fastq file has many reads from Listeria and the fasta file has the reference genome human.

As I was running the program with Listeria reads and reference genome from Listeria I had the expected results: by k-mer size 9 was the maximum number of shared k-mer then the number of shared k-mers started decreasing.

Could you please explain it to me, why the number of shared k-mers are increasing between k-mer size 9 and 12?

Thanks, Ahmad

dnbaker commented 3 years ago

Hi Ahmad,

Thanks for checking in on this, but I suspect you're running into something a little interesting: both the numerator and the denominator are changing as a function of k.

If you add the additional flag of --sizes, you'll get to see how the set sizes change, in the intersection and for the input sequences. My guess is that the denominator is changing substantially with these fairly small k.

Let me know what you find, and thanks for making the issue.

Best,

Daniel

lutfia95 commented 3 years ago

Hi Daniel,

I have another question to understand the output exactly, so let`s back to the command-line:

./dashing cmp –k10 -p5 -C -S24 --wj-exact reference_.fasta reads.fastq The output is:

Path Size (est.)

Listeria.fastq 1048584 Listeria_monocytogenes_ATCC19115.fasta 888344

Names Listeria.fastq Listeria_monocytogenes_ATCC19115.fasta

Listeria.fastq - 0.847185 Listeria_monocytogenes_ATCC19115.fasta - -

And if I run: ./dashing cmp –k10 -p5 -C -S24 --wj-exact --sizes reference_.fasta reads.fastq The output is:

Path Size (est.)

Listeria.fastq 1048584 Listeria_monocytogenes_ATCC19115.fasta 888344

Names Listeria.fastq Listeria_monocytogenes_ATCC19115.fasta

Listeria.fastq - 888345 Listeria_monocytogenes_ATCC19115.fasta - -

Could you please explain to me, what are the numbers 888344, 1048584 and 888345 meaning?

Best, Ahmad

lutfia95 commented 3 years ago

I think that I understood the output, so the output of my test is:

./dashing cmp -k10 -p5 -C -S24 --sizes reads.fastq reference.fasta

Path Size (est.)

reads.fastq 1048218 reference.fasta 767497

Names reads.fastq reference.fasta .fasta

reads.fastq - 767491 reference.fasta.fasta - -

Explanation: 1048218 is the size of reads set ( IAI ) 767497 is the size of reference set ( IBI ) 767491 not sure about it! ./dashing cmp -k10 -p5 -C -S24 --wj-exact reads.fastq reference.fasta

Path Size (est.)

reads.fastq 1048584 reference.fasta 888344

Names reads.fastq reference.fasta

reads.fastq - 0.847185 reference.fasta - -

Explanation: 1048584 is the union size 888344 is the intersection size 0.847185 is the jaccard similarity (888344/1048584)

correct?

Thanks!

best, Ahmad

dnbaker commented 3 years ago

Hi Ahmad,

Sure! First, Dashing emits the cardinalities of the input sequences, and then it emits pairwise similarities.

Above, 1048218 is the estimated number of k-mers in the read set, while 767497 is the number of k-mers in the fasta and 767491 is the number of k-mers shared between the read and the reference. This is because --sizes was added, so the emitted results are intersection sizes (= shared k-mers).

Without --sizes, it calculates the fraction of shared k-mers over total k-mers. (|A AND B|) / (|A OR B|). Intersection is the number of k-mers which are in both sets.

Hope it helps, and feel free to ask any more questions,

Daniel

dnbaker commented 3 years ago

One more thing to add:

The union and intersection are related.

|A OR B| = |A| + |B| - |A & B|

the emitted quantity for the pair is |A & B| for --sizes, and |A & B| / (|A OR B|) for the Jaccard. We estimate |A|, |B|, and |A OR B| to compute |A & B|.

lutfia95 commented 3 years ago

Thanks for the answers, I am trying now to run with --containment-index because I want to have the containment score to the reads set. I would like to ask about the output from it, I am running with the same command line: ./dashing cmp -k10 -p5 -C -S24 --containment-index reads.fastq reference.fasta output: reads.fastq 1048218 reference.fasta 767497 reads.fastq 1048218 reference.fasta 767497 reads.fastq 1.000000 0.999992 reference.fasta 0.732186 1.000000

we know that 1048218 is the number of k-mers in reads set and 767497is the number of k-mers in reference set. The containment score is counted by: (|A & B| / |A|) Can you please explain to me, what are the number0.999992,0.732186 and 1.000000 mean?

Best

Ahmad

lutfia95 commented 3 years ago

I think that the value 0.999992 is the answer from containment score, where |A| is the reads set. correct?

Thanks

best Ahmad

dnbaker commented 3 years ago

Hi, sorry for the delay on responding.

In this case, then 0.999992 is the containment score in one direction, while the containment score in the other direction was .73. This is because 73% of the k-mers in one set were in the other set, but in the opposite direction, nearly all k-mers were present in both, which I wouldn't usually expect unless the k was small, which it is in your case. If you want better discrimination, try k = 16 or 20.

lutfia95 commented 3 years ago

Thanks for the answer, do you mean with one direction, when I take (|A & B| / |A|), where |A| is the reads set? And the other direction will be, where |A| is the reference set ?

dnbaker commented 3 years ago

For the containment problem, it computes containment in both directions: |A & B| / |A|, and |A & B| / |B| is the other, meaning how many of k-mers from the reference were in the read set. In this case, only 70% of the k-mers from the reads were contained in the reference k-mer set, but nearly all the k-mers in the reference were in the sequencing data.

lutfia95 commented 3 years ago

Exactly that is what I thought, the 73% is the containment score from reads( 73% k-mers from the read set are contained in the reference set) that makes sense for me. Thank you very much for your help.

Best Ahmad