LTLA / csaw

Clone of the Bioconductor repository for the csaw package.
https://bioconductor.org/packages/devel/bioc/html/csaw.html
3 stars 3 forks source link

how to use the scaled factor to normalize the bigwig files of IP when compared with negative control? #15

Open eggprincess917 opened 3 years ago

eggprincess917 commented 3 years ago

Dear Aaron,

I have a question about the scale factor generated via scaleControlFilter. I have used the function"windowCounts" to calculate the bam file of both my IP and my negative control (whole chromatin as negtive control since without antibody pulldown). Then I used function "scaleControlFilter" get the normalization factors for each specific strains. The next step I want to do is try to use the normalization factor to normalize the "score" in the fourth column of granges (after I convert the bigwig files into granges) since I have different type of IPs(such as H3K4me1 and H3K4me2, etc) and they all use the same set of negative control. if I want the value of "score" in granges * a constant value. how could I convert the normalization factor to this constant value? Is it the same as the formula (

constant value = 1/(NormFactor * LibSize / 1000000)

) when using the TMM method to normalize the different IP samples? LibSize is the library size of IP. the normFactor is generate by the function "calcNormFactors" in package edgeR. if I am wrong, I would appreciate it if you would like to answer my questions above. I am looking forward to your reply!

Many thanks and best, Lingling.

LTLA commented 3 years ago

Yes, that calculation is correct, assuming your coverage corresponds to raw read coverage, i.e., the score for any given genomic interval is literally the number of reads in that interval. In this case, the scaling is equivalent to how edgeR computes offsets.

If the score contains some value that's already normalized, then the situation is trickier. If you're lucky, any pre-existing normalization is done by library size, in which case your constant value would just be the sample's normalization factor.

eggprincess917 commented 3 years ago

Dear Aaron, thanks for your kind reply. I still have two questions. My first question is : If I plan to get the normalization factors based on the negative control, once I normalized the IP bigwig files with the constant value, in order to get a matrix for plotting the metaplot genome-wide, do I need to use the value of IP to divide the value in the input? For example, if I want to draw the metaplot of the gene body for specific IP, should I use the matrix of normalized IP/input or just the matrix of normalized IP to get such figures?

My second question is: is the formula for getting the constant value (when using the loess strategy ) same as the formula of constant value above? I would appreciate it if you could give me some suggestions on that!

I am looking forward to your reply!

Many thanks and best regards, Lingling.

LTLA commented 3 years ago

If I plan to get the normalization factors based on the negative control, once I normalized the IP bigwig files with the constant value, in order to get a matrix for plotting the metaplot genome-wide, do I need to use the value of IP to divide the value in the input? For example, if I want to draw the metaplot of the gene body for specific IP, should I use the matrix of normalized IP/input or just the matrix of normalized IP to get such figures?

I don't know what you mean here. When you compute normalization factors, you should get one factor for each sample, regardless of whether they are IP or control. So just follow the same procedure to obtain normalized coverage for each sample. It would not make any sense to compare normalized coverage for IP with the unnormalized coverage of the control.

is the formula for getting the constant value (when using the loess strategy ) same as the formula of constant value above?

No. There is no constant value for the loess, because the normalization of each window depends on its abundance. It would not be straightforward to compute normalized coverage of genomic tracks in this case.

eggprincess917 commented 3 years ago

If I plan to get the normalization factors based on the negative control, once I normalized the IP bigwig files with the constant value, in order to get a matrix for plotting the metaplot genome-wide, do I need to use the value of IP to divide the value in the input? For example, if I want to draw the metaplot of the gene body for specific IP, should I use the matrix of normalized IP/input or just the matrix of normalized IP to get such figures?

I don't know what you mean here. When you compute normalization factors, you should get one factor for each sample, regardless of whether they are IP or control. So just follow the same procedure to obtain normalized coverage for each sample. It would not make any sense to compare normalized coverage for IP with the unnormalized coverage of the control.

is the formula for getting the constant value (when using the loess strategy ) same as the formula of constant value above?

No. There is no constant value for the loess, because the normalization of each window depends on its abundance. It would not be straightforward to compute normalized coverage of genomic tracks in this case.

Dear Aaron,

Thank you for your patient and kind reply from the bottom of my heart. Now I know how to do it! I am wondering if it is possible to compare the patterns before normalization and after normalization via MA plot to check the efficiency of TMM composition bias correction? For example, can I check the MA plot when only using the CPM or the MA plot with CPM and TMM together?
Since I found after normalization with CPM and TMM composition bias correction, the y=0 passes through the centre of the cloud. I just want to further confirm that y=0 pass through the cloud is due to TMM-based normalization instead of the original data itself. Looking forward to your reply when the time is available for you.

Stay healthy!

Best, Lingling.

LTLA commented 3 years ago

I am wondering if it is possible to compare the patterns before normalization and after normalization via MA plot to check the efficiency of TMM composition bias correction? For example, can I check the MA plot when only using the CPM or the MA plot with CPM and TMM together?

Yes. Just create two MA plots, one using the CPMs computed with the standard library sizes, another with CPMs computed with the effective library sizes (i.e., multiplied by the TMM normalization factors).

eggprincess917 commented 3 years ago

I am wondering if it is possible to compare the patterns before normalization and after normalization via MA plot to check the efficiency of TMM composition bias correction? For example, can I check the MA plot when only using the CPM or the MA plot with CPM and TMM together?

Yes. Just create two MA plots, one using the CPMs computed with the standard library sizes, another with CPMs computed with the effective library sizes (i.e., multiplied by the TMM normalization factors).

Dear Aaron, thank you for your kind reply. I still have questions about the normalization. I have used the TMM method to normalize my raw bigwig files (the fourth column)based on the composition bias elimination. My data is ChIP-seq data, therefore the next step I want to do is to try to analyze the different binding peaks among different types of samples using edgeR. But I found that edgeR only accepted raw bam files as the input to generate the DGEList for the downstream analysis. For me, I would prefer to use the following strategy to identify the differential binding peaks:

  1. I have got a list of peaks via the function "callpeak" in MACS2 and then I pool all the peaks from all samples(like all different strains and all replicates) of one specific type of IP, like H2A.Z.
  2. to get a count matrix of h2AZ, I use CountOverlaps in GenomicRanges to calculate the read counts mapped to each peak.
  3. Then I want to prepare the DGEList by hand instead of using the raw bam file to generate such DGEList in edgeR since I want to keep consistent between normalized bigwig files and also the count matrix of all peaks in H2AZ.
  4. In the end, I do the downstream analysis to get the differential binding peaks.

My confusion comes from step 3. As you know, there are some options when preparing the DGEList. For example, the scaling factor and library size. Since the edgeR needs to import the raw count matrix, I have to use the raw bigwig files to generate the count matrix of peaks in the different samples for H2AZ IP. I also need to set the library size and scaling factor. For the library size, I am wondering should I use the library size generated when I normalized the raw bigwig files (based on TMM-based composition bias) or should I re-calculate the library size by colSums(count Matrix)? For the scaling factor, the question is the same, can I use the scaling factor directly got from the function of calcNormFactors when normalizing the raw bigwig files or should I use the constant value I mentioned above? or I should re-calculate the new scaling factors based on the raw count matrix import in the DGEList?

I would appreciate it if you would like to give me some answers to the questions above.

Many thanks and best regards, Lingling.

LTLA commented 3 years ago

But I found that edgeR only accepted raw bam files as the input to generate the DGEList for the downstream analysis.

edgeR just needs counts, it doesn't care where they come from. Perhaps you're talking about csaw here.

  • I have got a list of peaks via the function "callpeak" in MACS2 and then I pool all the peaks from all samples(like all different strains and all replicates) of one specific type of IP, like H2A.Z.

Consider pooling reads before peak calling, see https://pubmed.ncbi.nlm.nih.gov/24852250/.

For the library size, I am wondering should I use the library size generated when I normalized the raw bigwig files (based on TMM-based composition bias) or should I re-calculate the library size by colSums(count Matrix)?

See the distinction between normalizing for IP efficiency vs composition bias in http://bioconductor.org/books/3.13/csawBook/chap-norm.html#sec:normchoice.