Tagging and recruitment based on quality string constraint

JustinChu commented 7 years ago

Add option to make quality string considered when tagging reads to minimize low quality k-mers.

Possible other options: Any k-mers below a specified quality should not be used when considering matches?

sahammond commented 7 years ago

I think having both options would be powerful. Perhaps also an option to set the size of the window used to consider base quality; i.e. whether a single poor quality base is enough to cause BBT to abandon further k-mers, or if the average quality of n consecutive bases is required.

KristinaGagalova commented 7 years ago

I also agree with this idea: it looks like error kmers play an important role in the tagging stage. It is maybe worth to implement a scoring system for consecutive bases which looks like more robust. I have also another concerning: what will happen if we'll have an adapter sequence in the reads? Will be this used for recruitment? Can we also include the "adapter sequences" as low quality k-mers?

JustinChu commented 7 years ago

Scoring clearly needs to be different for tagging than recruitment which is why we allow integer value to be used instead. I've been toying with an idea for scoring using k-mer run-length (longer consecutive runs are better) rather than partial scores for each k-mer matches.

If implement this option I think we have to limit this option to trim rather than remove "bad" k-mers because if 1 quality score is bad in the middle of the read, we obliterate k k-mers which is frankly too much. However, this means that effectively this is the same as feeding BBT a set of trimmed reads.

Adaptors are an issue but I think adaptor trimming will have to be another tool I think. Mostly I'm not sure how I'd implement the trimming inside the code.

bcgsc / biobloom

Tagging and recruitment based on quality string constraint #22