TravisWheelerLab / AvxWindowFmIndex

A fast, AVX2 and ARM Neon accelerated FM index library
BSD 3-Clause "New" or "Revised" License
28 stars 2 forks source link

0 counts for k-mer generated from same sequence as used in FM-index #41

Closed EricR86 closed 4 weeks ago

EricR86 commented 1 month ago

Hello,

I came across an odd bug. I created an index for mm39. I then generated a k-mer sequence from mm39 chr19: 'GGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGG'

While very repetitive, the awFmParallelSearchCount function returns 0 for this k-mer in the search list.

I can confirm the sequence exists inside chr19:

From lines 766498 onward (refseq: NC_000085.7):

                                                                        *
TTCACAGGAATACCCCACTCTGCTGGTACCAATTTGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGG
TTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTT
AGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAG
GGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGG

I marked the beginning of the sub-sequence with a *

I am on the latest commit. I have not regenerated the index since commit 71e7dd6. There were 5 of these total I found in the entire genome, all very long k-mers of length 250-ish or over. This was just the example I managed to track down

EricR86 commented 1 month ago

It is worth noting that adding a letter to the problematic k-mer sequence results in a count of 223, adding another letter ends up with a count of 553055957 (!). Removing a letter from the problematic k-mer sequence results in a count of 226.

Sawwave commented 1 month ago

Thanks for the bug report. I'm currently wrapping up my Ph.D dissertation so my time is limited, but I'll address this as soon as I have the time.

Sawwave commented 4 weeks ago

Now that I have time to look at this, I've confirmed that I can reproduce the error. I'm working to identify the problem, and I hope to have this fixed soon.

Sawwave commented 4 weeks ago

I've solved this issue with merge #42 . There was an issue with some query position variables using uint8_t's, and having issues with queries longer than 255. These variables have been updated to uint32_t's, so they should handle even unreasonably long queries.

Thanks again for reporting this error!