bcgsc / biobloom

Create Bloom filters for a given reference and then use it to categorize sequences
http://www.bcgsc.ca/platform/bioinfo/software/biobloomtools
GNU General Public License v3.0
75 stars 15 forks source link

Detailed progressive bloom filter report #25

Closed JustinChu closed 7 years ago

JustinChu commented 7 years ago

Because -P is mostly used for debugging and diagnosis of progressive read tagging, I should report additional information than just the reads for ease of analysis.

A proposed new format for -P option in BBM: Read header should include:

  1. Number of k-mers included from this read
  2. Number of k-mers already in filter
  3. Number of k-mers of read that match repeat filter (*edit thanks @KristinaGagalova)
  4. Number of read pairs already in filter
  5. Read pairs number in file (index)

Note: Assumes the header has no existing comment lines

e.g. (for a single read pair)

@read/1 20 1289300 0 1244 12002
TGGTGCCCAGCAGCGTTTGTAGCGCAATGAGAATTTGCTGCGTCAGACATTCCTGCACCTGCGGACGCTTGGCAAAGAAATGCACAATGCGGTTAATTTTTGACAGACCGATCACCGAAT
CTTTCGGGATATAGGCCACCGTCGCTTTGC
+
2611023222153/43222062322.422/12303462216553442222514220112424444034412251261012142142123.4232210/0/22222231342242131021
512223302201430131050232254013
@read/2 126 1289300 5 1244 12002
TGGTGCCCAGCAGCGTTTGTAGCGCAATGAGAATTTGCTGCGTCAGACATTCCTGCACCTGCGGACGCTTGGCAAAGAAATGCACAATGCGGTTAATTTTTGACAGACCGATCACCGAAT
CTTTCGGGATATAGGCCACCGTCGCTTTGC
+
2611023222153/43222062322.422/12303462216553442222514220112424444034412251261012142142123.4232210/0/22222231342242131021
512223302201430131050232254013

Suggestions on additional information to be provided are welcome. @KristinaGagalova @sahammond

Note that the changes will be made to the https://github.com/bcgsc/biobloom/tree/ntHashBF branch first. This is version that uses ntHash and other speed optimizations and has in some tests I have done an order of magnitude faster performance. So far the code seems stable with no major differences between the old and new code in terms of results.

KristinaGagalova commented 7 years ago

Sounds great! We definitely need something like that for debugging.

The other thing that will be VERY useful is the information that comes from the repeat filter so we could work on the improvement of repeats recruitment. What about including the number of kmers in the read considered as repeats? We can trace back if the read ends up in the recruited reads even if containing a considerable amount of repeat kmers (this is for the tagged reads)

JustinChu commented 7 years ago

Edit proposed. I think I'll have to check the repeat filter meaning we will incur a bit more computational cost but only when we run -P. Shouldn't be a big deal since it is for debugging anyway.

JustinChu commented 7 years ago

Code seems to be fine on my end: Branch https://github.com/bcgsc/biobloom/tree/ntHashBF 80248e0d5720eb64b3d6213e8a2ceec227305eca