Which is consider as peak enrichment in output

Genometric / MSPC

Using combined evidence from replicates to evaluate ChIP-seq peaks

https://genometric.github.io/MSPC/

GNU General Public License v3.0

19 stars 10 forks source link

Which is consider as peak enrichment in output #88

Closed EllieDuan closed 5 years ago

EllieDuan commented 5 years ago

Hi, I have a question about the output from MSPC. Do I use p-value (I think this is -log10(p-value)?) or use value in xSquare as the enrichment score for peaks intensity? I would like to determine the peak enrichment in a list of genes. Thank you!

Best, Ellie

VJalili commented 5 years ago

Hi Ellie,

A sample output should look like:

chr1    10  20  name    10.495  48.33   10.495  10.169

Column 1: chromosome;
Column 2: Start position;
Column 3: Stop position;
Column 4: Name;
Column 5: p-value in -Log10 format (it's the original p-value of the peak as parsed from source in all the files, except for the ConsensusPeaks file, where it is the right-tailed probability of X^2 of combined peaks);
Column 6: X^2 of the p-values of all the peaks overlapping this peak (including the peak itself) with 2k degree of freedom, where k is the number of overlapping peaks;
Column 7: is the right-tailed probability of the X^2 in -log10 format;
Column 8: is the adjusted p-value (column 5) using the Benjamini-Hochberg procedure at the given alpha threshold, represented in -log10 format.

Hence if you're interested in the combined stringency (how stringent is a peak based on all the peaks overlapping with it?), you may use columns 6 or 7.

EllieDuan commented 5 years ago

Great! Thank you for this information!! Very helpful!

EllieDuan commented 5 years ago

Sorry to bother you again.

I just realized these may not be the enrichment score I'm looking for. After MACS2 call, I direct piped the -log10(p-value) from narrow.peak file for all replicates to MSPC to generate the ConsensusPeaks.bed.

I'm wondering how should I obtain log2FC of IP/input after MSPC as the enrichment score? or do you suggest to only use these peaks as the template and then calculate FPKM in downstream?

I guess I'm still not sure if statistical score (column $6 or $7) can be used to plot protein binding on TSS of gene-list1 compare to gene list2.

Thank you!

marziacremona commented 5 years ago

Hi Ellie,

MSPC only combines the peak p-values after MACS2 call, it doesn't directly use IP or input. Hence in MSPC output you don't have log2FC, but only the p-value of combined peaks (column $7).

Can you explain something more about the analysis you are performing? Why do you need to look at log2FC? To compare protein binding on two lists of TSS, people often look only at the number of peaks that overlap the TSSs, or at their position with respect to TSSs. In this case I would suggest to consider the ConsensusPeaks file, and analyze the position of these peaks with respect to the two lists of genes.

EllieDuan commented 5 years ago

Hi Marzia, Thank you for the reply and suggestions!

I just would like to know if protein binding is more strong in one of two lists. Like a box plot of the enrichment score of peaks fall into these 2 lists.

I also would like to compare the enrichment of two proteins at the same gene list (we did differential analysis using csaw to these 2IP as well).

I think of plotting this because the plot of peak counts frequency and normalized read counts (this one need input log2 normalize) give us the opposite result:

One protein showed the most of peak counts at gene TSS but lowest in normalized read counts, another protein is opposite.

So you suggest to use peak count frequency to represent the peak enrichment?

I know some tools like deeptools plot average profile of normalized read counts, just not sure which one (peak counts or read counts) could represent the protein binding.

Thank you very much!!

marziacremona commented 5 years ago

I just would like to know if protein binding is more strong in one of two lists. Like a box plot of the enrichment score of peaks fall into these 2 lists. For this type of analysis I suggest to use peak count, plotting peak count in each nucleotide of a window around the list of TSS, and comparing the two count plot.

Since you say that peak counts frequency and normalized read counts give you different results, one thing that you can do is to combine the reads of your replicates in order to compute read counts, but using as peak positions the ConsensusPeaks.

just not sure which one (peak counts or read counts) could represent the protein binding

My understanding is that every peak represents a protein binding (if you exclude blacklist regions). Read counts (and peak shape) are related to the amount of cells in the experiment that have the protein binding, and to the way the protein binds the DNA (direct/indirect binding, alone or in multiple copies).

In any case, I wouldn't use Chi-square values or MSPC p-values to measure protein binding strength... to me they are related to how much you can trust that the peak is present in that position.

EllieDuan commented 5 years ago

Great! Thank you very much!