BradnerLab / pipeline

bradner lab computation pipeline scripts
Other
53 stars 47 forks source link

provide option for a different normalization method #42

Open jdimatteo opened 10 years ago

jdimatteo commented 10 years ago

The current normalization method doesn't work well when comparing normalized across different bams with different read lengths. We probably shouldn't change it for legacy compatibility reasons, but we may want to provide a way to override it with a more robust option.

This probably isn't very important, since bamliquidator_batch usually serves as a low level tool used by higher level frameworks, and those higher levels could probably trivially calculate a different normalization value.

semenko commented 10 years ago

Cool idea -- what were you thinking for a different normalization option?

jdimatteo commented 9 years ago

@semenko I was thinking a method where all the counts for a given .bam file add up to 1

For bamliquidator with bins, just divide each count by the total of the counts, e.g. if two bins and just one chromosome:

  1. chr1, bin 1 count is 3, normalized count is 3/(3+4)
  2. chr2, bin 2 count is 4, normalized count is 4/(3+4)

For regions, first divide by bin width to normalize on bin width. Then divide each of the normalized-on-bin-width counts by the total of the normalized-on-bin-width counts. For example, if two regions:

  1. chr1, start 100, stop 200, count is 1, normalized by bin width count is 1/(200-100)= 0.01, then final normalized count is 0.01/(0.01+0.003)=0.769230...
  2. chr2, start 2000, stop 3000, count is 3, normalized by bin width count is 3/(2000-1000)=0.003, then final normalized count is 0.003/(0.01+0.003)=0.230769...

I should probably add twice the extension length to the region size for this calculation.

Maybe I'll add the command line option -u/--unity_normalization to use this alternative normalization method where things add up to 1, and the default will remain bases per million reads per base?

@bradnerComputation : does this sound like a less arbitrary normalization option? Does adding twice the extension length to the region size make sense?

I don't want to add many normalization options, but I would like to add one more normalization option that isn't skewed by bams with different read lengths.