Closed JustinChu closed 5 years ago
I think having both options would be powerful. Perhaps also an option to set the size of the window used to consider base quality; i.e. whether a single poor quality base is enough to cause BBT to abandon further k-mers, or if the average quality of n consecutive bases is required.
I also agree with this idea: it looks like error kmers play an important role in the tagging stage. It is maybe worth to implement a scoring system for consecutive bases which looks like more robust. I have also another concerning: what will happen if we'll have an adapter sequence in the reads? Will be this used for recruitment? Can we also include the "adapter sequences" as low quality k-mers?
Scoring clearly needs to be different for tagging than recruitment which is why we allow integer value to be used instead. I've been toying with an idea for scoring using k-mer run-length (longer consecutive runs are better) rather than partial scores for each k-mer matches.
If implement this option I think we have to limit this option to trim rather than remove "bad" k-mers because if 1 quality score is bad in the middle of the read, we obliterate k k-mers which is frankly too much. However, this means that effectively this is the same as feeding BBT a set of trimmed reads.
Adaptors are an issue but I think adaptor trimming will have to be another tool I think. Mostly I'm not sure how I'd implement the trimming inside the code.
Add option to make quality string considered when tagging reads to minimize low quality k-mers.
Possible other options: Any k-mers below a specified quality should not be used when considering matches?