ArnaudDroitLab / metagene2

4 stars 0 forks source link

Normalize BAM coverages by total reads within each sample #39

Open aliciagallego opened 8 months ago

aliciagallego commented 8 months ago

Hello,

I would like to do metaplots based on only one BAM file and several sets of regions. However, for each set region, I want to normalize each bin by the total coverage across all bins in that region set. This is because I am interested in measuring the slope of the metaplot removing the effect of the expression levels. I have tried to do it manually (code explained below) but I have a couple of questions.

Manual approximation: I have modified the data frame (df) obtained with the function add_metadata. So, I first calculated the sum of the coverages across all bins in each set of regions (A and B). Second, I divided the coverage of each bin by the total sum calculated in the previous step:

mg <- metagene2$new(regions = regionsAB, bam_files = bam_file)

mg$group_coverages(normalization="RPM")

df <- mg_ali$add_metadata()

total_A<-sum(df[which(df$region_name=="listA"),1]) total_B<-sum(df[which(df$region_name=="listB"),1])

for(i in 1:nrow(df)){ if(df[i,"region_name"]=="listA"){ df[i,1]<-df[i,1]/total_A} else if (df[i,"region_name"]=="listB"){ df[i,1]<-df[i,1]/total_B}}

metaplot <- plot_metagene(df)

Results: To illustrate what I want to do, I show you the original figure just normalizing by using $group_coverages(normalization="RPM") function (up) and the one that I generated after applying manual normalization of the bins (bottom).

Questions:

  1. Is it there any way for doing this bin normalization by using any regular function or parameter of metagene2?
  2. If my manual approximation is right, how could I re-calculate the confidence intervals of the new bin values in order to include the qinf and qsup ranges in the plot?

prueba_NormalizeByBin prueba_NormalizeByBin2

ericfournier2 commented 8 months ago

Hi,

unfortunately, metagene2 does not support this kind of normalization intrinsically. This is why we added functions to access the dataframes though, so you could do it on your own the way you are doing now.

I'm not familiar with the biological problem you are trying to model here, so I cannot vouch that metagene2's resampling approach is appropriate (especially if you have a single BAM file), but you can find metagene2's code for its resampling strategy in the permutation.R file: https://github.com/ArnaudDroitLab/metagene2/blob/master/R/permutation.R

Hope this helps!

ericfournier2 commented 8 months ago

I -think- that since you're only dividing everything by a constant factor, you could also divide the old confidence intervals by the same factor and you'd get the same results, but that's just my gut instinct.

aliciagallego commented 8 months ago

Thank you Eric, that worked perfectly well!

I -think- that since you're only dividing everything by a constant factor, you could also divide the old confidence intervals by the same factor and you'd get the same results, but that's just my gut instinct.