Closed JustinChu closed 7 years ago
Sounds great! We definitely need something like that for debugging.
The other thing that will be VERY useful is the information that comes from the repeat filter so we could work on the improvement of repeats recruitment. What about including the number of kmers in the read considered as repeats? We can trace back if the read ends up in the recruited reads even if containing a considerable amount of repeat kmers (this is for the tagged reads)
Edit proposed. I think I'll have to check the repeat filter meaning we will incur a bit more computational cost but only when we run -P. Shouldn't be a big deal since it is for debugging anyway.
Code seems to be fine on my end: Branch https://github.com/bcgsc/biobloom/tree/ntHashBF 80248e0d5720eb64b3d6213e8a2ceec227305eca
Because -P is mostly used for debugging and diagnosis of progressive read tagging, I should report additional information than just the reads for ease of analysis.
A proposed new format for -P option in BBM: Read header should include:
Note: Assumes the header has no existing comment lines
e.g. (for a single read pair)
Suggestions on additional information to be provided are welcome. @KristinaGagalova @sahammond
Note that the changes will be made to the https://github.com/bcgsc/biobloom/tree/ntHashBF branch first. This is version that uses ntHash and other speed optimizations and has in some tests I have done an order of magnitude faster performance. So far the code seems stable with no major differences between the old and new code in terms of results.