Improve behavior for progressive mode repeat/subtrative filter

bcgsc / biobloom

Create Bloom filters for a given reference and then use it to categorize sequences

GNU General Public License v3.0

76 stars 15 forks source link

Due to false positives in the subtractive filter, it can cause false recruitment terminations. This is not that evident when using small -r values (0.1-0.3) but should impact any -r large values and when they are integers.

Reasons for how this is bad: we will obliterate any k-mers that fall within 1/(FPR)bp. For Kollector, if we have a large genic space and use stringent population strategies (large -r values), it will cause an abrupt termination of k-mers ever 1/(FPR) bases. We may compensate by bridging across these gaps using paired end information but I suspect these false terminating k-mers to inhibit filling of these gaps, promoting some off-target behavior.

A better scheme would be to:

store subtractive filter as an exact data-structure (hash table), or
change the scoring algorithm in progressive mode, or
treat repeat filter hits more leniently (not as non-matches if they are near non-repeat hits).

bcgsc / biobloom

Improve behavior for progressive mode repeat/subtrative filter #23