eaasna / valik

Local sequence similarity search tool
BSD 3-Clause "New" or "Revised" License
0 stars 3 forks source link

Debug bucket borders #111

Open eaasna opened 1 month ago

eaasna commented 1 month ago

Use lib/seqan from https://github.com/eaasna/seqan/pull/1

This PR investigates a segmentation fault that occurs when searching the human reference genome for matches for the mouse reference genome. There is a memory error because one of the buckets in the QGramDir is defined such that the bucketBegin > bucketEnd. For each k-mer the QGramDir stores its bucketBegin index. The bucketBegin index points to the QGramSA that stores the positions of the k-mer. The bucketEnd is inferred from the beginning of the next k-mer. Because the SWIFT index uses open addressing, there are 2 hash functions applied to the k-mer. The first (e.g hash1(AA) = 14) hash function value is stored in the BucketMap at position e.g hash2(hash1(AA)) = hash2(14). K-mer lookups probe the BucketMap until a matching hash value or empty bucket is found.

The GGramDir is built in two steps:

  1. count k-mers
  2. calculate the cumulative sum

After this, the suffix array is built.

It is unclear why the QGramDir is sometimes faulty, but it seems to be triggered by the lexicographically last 32-mer TTTTT....