genome / bam-readcount

Count bases in BAM/CRAM files
MIT License
305 stars 95 forks source link

Feature Request: Statistics per strand #13

Open smacarthur opened 10 years ago

smacarthur commented 10 years ago

It would be really useful if the statistics from read-counts were split by strand, for example the count of As on Fwd and Rev strands, and the mean base quality on each strand. This would be really useful for enrichment data, where there may be a stand bias. Let me know if you want some more use cases. Thanks,

Stewart

ernfrid commented 10 years ago

Hi Stewart,

Strand specific counts are already in the output. They are documented as:

num_plus_strand → number of reads on the plus/forward strand

num_minus_strand → number of reads on the minus/reverse strand

Breaking out the individual summary metrics by strand as well is not something I had considered before, but I can see the utility of providing that information. I will need to think about how best to report it within the existing file format or if this would require more extensive changes in the output format.

smacarthur commented 10 years ago

I saw the counts per strand, which are useful. I think having the other things like base quality per strand would also be really useful, though I understand the problems with the output. The easiest and ugliest way would be be have comma delimited values within colon separated values. 31,39:31.9,32.4: etc or you could use the same notation you have for the libraries with the curly braces, though I think that is actually far more difficult to parse. On a similar note do you have tools to parse the output? I have been parsing the output into an R data structure based on GenomicRanges which works well for me.

isthisthat commented 9 years ago

+1 any more thoughts on this? Also a summary option would be useful to collapse the =ACGTN stats into an average value. Thank you.