cannot find all the seeds

fataltes commented 7 years ago

Hi Hongzhe,

I tried to run your tool on a standard reference human genome data set (GRch37).

This is the command I used to run indexing process:
./deBGA index -k 27 GRch37.fa <index directory>

First of all, size of the index files we have at the end, is almost half of what you reported for k=22, while we expect it to be even bigger for k=27.

Total size of index output files for k=27 on GRch37 : 20G
The size you reported for k=22 on GRch37 : 40.32G In addition to that, based on my understanding of the index structure explained in the paper and partially looking at the code, we expect for example just the size of the hash table that keeps Unitig ID and offset for each kmer to be at least as big as <# of kmers * (hashed key size + log(Unitig ID) + log (max offset) )> bits which is around ~20Gs for ~2.7B distinct kmers in ~34M Unitigs for GRch37 which is about 15Gs right now.

And this is the one I used to validate: ./deBGA aln -k 27 -l 2048 <index directory> trimmed_GRch37_genomes.fq <output sam file> I queried for all the reads containing just valid continuous kmers from the reference and expected to find all of them, but instead I could almost find half of the seeds (I put a print for all found and not found seeds in seed_ali_core.c). Also, the resulting alignment output didn't contain alignments for all reads.

So based on the index size and the result I get, my interpretation is that the deBGA stores half of the kmers in indexing.

I've compiled the code with the default parameters and didn't add any options.

I wonder if I interpreted the indexing data structure wrong and you probably don't keep exact offsets or presence condition for all kmers or if I'm missing any configuration. So first of all, could you refer me to the data set that you bench marked your results on (There was no google drive link in the readMe file) and also let me know if I need to compile or run the code with a specific set of options.

Thank you,

HongzheGuo commented 7 years ago

Thank you so much for using deBGA. I am so sorry for my late reply and I was quite busy recently. I run the deBGA index -k 27 for GRch37 reference .fa file and I got the actual size of index is about 34G and the size (in Bytes) of each index file is in the followings: 2544 N.sta 773924008 ref.seq 409 unipath.chr 2147483656 unipath_g.hash 9873706980 unipath_g.kmer 19747413960 unipath_g.offset 2784571712 unipath.pos 327482472 unipath.posp 883187712 unipath.seqb 327482472 unipath.seqfb 80 unipath.size

The index size for GRch37 with k = 22 should be around 40GB since there always more positions info offset info to store for one uni-path for a shorter k-mer. Also the index size should not be quite different between the k-mer length within the range of 20 - 28. It should be noted that the actual memory usage during alignment is often more than the index size since deBGA needs some auxiliary variables to execute the alignment.

deBGA builds index for the reference only in one direction but not to build it for its reverse complementary strain. During the process of alignment, deBGA does the alignment for each pair-end read in forward direction and in reverse complementary direction, hence there should be four times alignment for every two pair-end reads.

deBGA won't support the single-end read mapping anymore, such as the 454

Hope it can help you.

Best,

Hongzhe

fataltes commented 7 years ago

Thanks for your reply,

I understand your point about the difference in index size for k=22 and k=27, however, still I'm not able to regenerate the same index with the same size that you say. That'll be great if you could please link me to the data set you are using and also give me the exact commands you run so that I can regenerate your test?

Also, as you mentioned there are four alignment sections in the code that I put prints in all four sections and still total number of seeds that are found is half of total number of actual seeds. As I said before, we use reads that we know fully map to the reference and hence expect any selected seed (kmer) out of it to be found. But still half of the seeds are not found. The only thing I did is that I added one additional line of code in each of the four sections to increment number of found seeds if the seed is found.

So do you have any idea what could go wrong or is it possible for you to give us a version of the code that actually counts total number of found and not-found seeds in the read set?

Thank you, Fatemeh

HongzheGuo / deBGA

cannot find all the seeds #3