bcgsc / biobloom

Create Bloom filters for a given reference and then use it to categorize sequences
http://www.bcgsc.ca/platform/bioinfo/software/biobloomtools
GNU General Public License v3.0
76 stars 15 forks source link

Improve behavior for progressive mode repeat/subtrative filter #23

Closed JustinChu closed 7 years ago

JustinChu commented 7 years ago

Due to false positives in the subtractive filter, it can cause false recruitment terminations. This is not that evident when using small -r values (0.1-0.3) but should impact any -r large values and when they are integers.

Reasons for how this is bad: we will obliterate any k-mers that fall within 1/(FPR)bp. For Kollector, if we have a large genic space and use stringent population strategies (large -r values), it will cause an abrupt termination of k-mers ever 1/(FPR) bases. We may compensate by bridging across these gaps using paired end information but I suspect these false terminating k-mers to inhibit filling of these gaps, promoting some off-target behavior.

A better scheme would be to:

JustinChu commented 7 years ago

Addressed in 7338f4b2b908a4527a9016f8cde6af6188141c55

Basically, I changed the scoring algorithm in progressive mode such that we treat repeat filter hits more leniently. They do not contribute to the score when classifying, however, they do not cause k-mer skipping (no longer treated like no-matches) and do not cause resetting of counts when using integer values for -r.

For example, if a repeat is found in the middle of a read, we no longer skip k k-mers as we would if a k-mer does not match the filter (when -r is between 0.0-1.0). If using an integer, we simply require a length of r + n where n is the number of repeat k-mers found in filter.