Closed JustinChu closed 7 years ago
Addressed in 7338f4b2b908a4527a9016f8cde6af6188141c55
Basically, I changed the scoring algorithm in progressive mode such that we treat repeat filter hits more leniently. They do not contribute to the score when classifying, however, they do not cause k-mer skipping (no longer treated like no-matches) and do not cause resetting of counts when using integer values for -r
.
For example, if a repeat is found in the middle of a read, we no longer skip k k-mers as we would if a k-mer does not match the filter (when -r
is between 0.0-1.0). If using an integer, we simply require a length of r + n
where n is the number of repeat k-mers found in filter.
Due to false positives in the subtractive filter, it can cause false recruitment terminations. This is not that evident when using small
-r
values (0.1-0.3) but should impact any-r
large values and when they are integers.Reasons for how this is bad: we will obliterate any k-mers that fall within 1/(FPR)bp. For Kollector, if we have a large genic space and use stringent population strategies (large -r values), it will cause an abrupt termination of k-mers ever 1/(FPR) bases. We may compensate by bridging across these gaps using paired end information but I suspect these false terminating k-mers to inhibit filling of these gaps, promoting some off-target behavior.
A better scheme would be to: