Open jdimatteo opened 9 years ago
Cool idea -- what were you thinking for a different normalization option?
@semenko I was thinking a method where all the counts for a given .bam file add up to 1
For bamliquidator with bins, just divide each count by the total of the counts, e.g. if two bins and just one chromosome:
For regions, first divide by bin width to normalize on bin width. Then divide each of the normalized-on-bin-width counts by the total of the normalized-on-bin-width counts. For example, if two regions:
I should probably add twice the extension length to the region size for this calculation.
Maybe I'll add the command line option -u/--unity_normalization to use this alternative normalization method where things add up to 1, and the default will remain bases per million reads per base?
@bradnerComputation : does this sound like a less arbitrary normalization option? Does adding twice the extension length to the region size make sense?
I don't want to add many normalization options, but I would like to add one more normalization option that isn't skewed by bams with different read lengths.
The current normalization method doesn't work well when comparing normalized across different bams with different read lengths. We probably shouldn't change it for legacy compatibility reasons, but we may want to provide a way to override it with a more robust option.
This probably isn't very important, since bamliquidator_batch usually serves as a low level tool used by higher level frameworks, and those higher levels could probably trivially calculate a different normalization value.