aindj / k-SLAM

k-SLAM ultra fast alignment and taxonomic classification of metagenomic datasets
GNU General Public License v3.0
23 stars 5 forks source link

0 k-mers overlap [bug] #19

Open g1o opened 6 years ago

g1o commented 6 years ago

My database contains lots of genomes, over 15k bacterial and all fungi from genbank, so it is unlikely for a result with 0 overlap. Also I had already run Clark and Kaiju and both return good results. here is the log: [t = 0.00s] Performing metagenomic analysis [t = 0.00s] Building taxonomy index [t = 2.78s] Built a taxonomy tree with 1664587 nodes [t = 2.78s] Building index from serial file /home/giovannimc/genomas_and_databases/k-SLAM//database [t = 1060.34s] Getting reads from FASTQ files A_1.fq and A_2.fq [t = 1075.66s] 2000000 reads [t = 1089.63s] 4000000 reads [t = 1089.63s] Aligning reads to database using k = 32 [t = 1089.63s] Getting k-mers from reads [t = 1092.50s] Obtained 476000000 k-mers from reads [t = 1092.50s] Getting k-mers from index [t = 1137.07s] Obtained 8984748392 k-mers from index [t = 1137.07s] Sorting k-mers [t = 1225.81s] Finding overlaps [t = 4852.00s] Found 0 k-mer overlaps [t = 4854.00s] Performing pairwise Smith-Waterman [t = 4854.04s] Screening all alignments with score < 0.000000 [t = 4854.04s] Screened 0 overlaps [t = 4854.04s] Pairing alignments [t = 4854.09s] Getting per read overlaps [t = 4854.09s] 0 entries have k-mer overlaps [t = 4854.09s] Calculating insert size distribution [t = 4854.09s] Screening all alignment pairs with insert size >= 4294967295 [t = 4854.09s] Screening all 0 alignment pairs by score [t = 4854.09s] Screened 0 overlaps [t = 4854.09s] Performing a pseudo-assembly [t = 4854.09s] Screening all 0 alignment pairs by score [t = 4854.09s] Screened 0 overlaps [t = 4854.09s] Converting alignments to metagenomic results [t = 4854.09s] Processed 2000000 reads [t = 4855.91s] Getting reads from FASTQ files A_1.fq and A_2.fq [t = 4869.41s] 2000000 reads [t = 4883.05s] 4000000 reads [t = 4883.05s] Aligning reads to database using k = 32 [t = 4883.05s] Getting k-mers from reads [t = 4885.81s] Obtained 476000000 k-mers from reads [t = 4885.81s] Getting k-mers from index [t = 4928.34s] Obtained 8984748392 k-mers from index [t = 4928.34s] Sorting k-mers [t = 5003.33s] Finding overlaps [t = 8718.98s] Found 0 k-mer overlaps [t = 8720.88s] Performing pairwise Smith-Waterman [t = 8720.92s] Screening all alignments with score < 0.000000 [t = 8720.92s] Screened 0 overlaps [t = 8720.92s] Pairing alignments [t = 8720.96s] Getting per read overlaps [t = 8720.96s] 0 entries have k-mer overlaps [t = 8720.96s] Calculating insert size distribution [t = 8720.96s] Screening all alignment pairs with insert size >= 4294967295 [t = 8720.96s] Screening all 0 alignment pairs by score [t = 8720.96s] Screened 0 overlaps [t = 8720.96s] Performing a pseudo-assembly [t = 8720.96s] Screening all 0 alignment pairs by score [t = 8720.96s] Screened 0 overlaps [t = 8720.96s] Converting alignments to metagenomic results [t = 8720.96s] Processed 4000000 reads [t = 8722.70s] Getting reads from FASTQ files A_1.fq and A_2.fq [t = 8735.55s] 2000000 reads [t = 8749.49s] 4000000 reads [t = 8749.49s] Aligning reads to database using k = 32 [t = 8749.49s] Getting k-mers from reads [t = 8752.46s] Obtained 476000000 k-mers from reads [t = 8752.46s] Getting k-mers from index [t = 8794.34s] Obtained 8984748392 k-mers from index [t = 8794.34s] Sorting k-mers [t = 8875.47s] Finding overlaps [t = 12633.03s] Found 0 k-mer overlaps [t = 12635.21s] Performing pairwise Smith-Waterman [t = 12635.27s] Screening all alignments with score < 0.000000 [t = 12635.27s] Screened 0 overlaps . . . There are a few points here: Exact number of k-mers extracted from the Fastq files, feels like it is getting the header of the reads and not the sequence, otherwise how could it had an exact number of k-mers for all the 2M batches? Maybe some other bug, but I have no idea what can it be. Also, I tried before to execute using gziped fastq files, it accepted. However it does not know how to deal with it and SLAM does not check input for a fastq format, the result was the same 0 overlap. It needs to check the fastq format.

biofuture commented 6 years ago

I have similar problem. Quite strange! Hope the authur could answer this question soon!