Closed nzx9581 closed 2 years ago
Thanks for using mspc, @nzx9581!
Consensus peaks are constructed based on the true-positive peaks of all the replicates provided in the input. Accordingly, one peak in the consensus set represents 1 to many peaks in the true-positive sets; hence, its p-value does not necessarily correlate to the p-value of a peak in the input. It only correlates, iff it represents a single peak in the true-positive set. Please see the example on this page, it may better explain how these peaks are constructed.
If you are seeing more peaks with inf
in the output than you expect, it could be caused by either or both of the following:
-log10
format? if not, you can adjust the parser to read the peaks w.r.t. to the format of your data. Please check these and let me know if they resolve the issue, otherwise, I'm happy to help you debug the issue.
Thank you for your reply.
I checked my input files according to your suggestion, and finally I found the pvalue(-log10) is in the 8th column of MACS2 .narrowPeak files which I used as the input of MSPC before, however, MSPC only accepts the input files where the 5th column should be pvalue. Now I think I have got the right results. Thank you so much! I will close this issue.
Glad that was useful. As I mentioned above you should be able to adjust mspc to read the p-value from a different column that best matches your data.
hello
i'm using rmspc. and I had the same problem about inf value
.
my input is narrowPeak file from macs2
so the pvalue is column 7. the -log10(p-value) of first peak is 6.29862. Right?
Then I create a MSPC_configuration.json file like:
{ "Chr":0, "Left":1, "Right":2, "Name":3, "Strand": 5, "Value":7, "PValueFormat":1, "Culture":"en-US" }
my code is
results_WT <- mspc(input = WT, replicateType = "Bio", stringencyThreshold = 1e-8, weakThreshold = 1e-4, gamma = 1e-8, keep = TRUE, GRanges = TRUE, multipleIntersections = "Lowest", c = 2, alpha = 0.05, outputPath = output, degreeOfParallelism = 8, inputParserConfiguration = "D:/1Data/genome/MSPC_configuration.json")
But there's still a lot inf value
in my results.
How to solve it? thank you
Could you please share a reproducible example? e.g., a few overlapping peaks across your replicates that, when passed to mspc, it returns inf
. One way to get this would be to start from one peak in the output with p-value inf
, then find all the peaks in your input replicates that overlap that output peak.
Could you please share a reproducible example? e.g., a few overlapping peaks across your replicates that, when passed to mspc, it returns
inf
. One way to get this would be to start from one peak in the output with p-valueinf
, then find all the peaks in your input replicates that overlap that output peak.
Thanks for your reply.
I check my ConsensusPeak.bed file to find an example
For example, consensusPeak file in output like this
chr1 4495964 4497626 MSPC_Peak_54519 inf
and I check the position of this peak, from 4495964 to 4497626. Then I find peaks in input narrowPeak file according to the position.
repeat 1
chr1 4495964 4497620 GSM4871342_peak_5 1111 . 18.8488 113.574 111.144 544
repeat 2
chr1 4495979 4497626 GSM4871344_peak_3 1618 . 23.9207 164.41 161.897 519
repeat 3
chr1 4496010 4497467 GSM4871346_peak_4 921 . 18.4848 94.5751 92.1984 520
repeat 4
chr1 4496010 4497393 GSM4871348_peak_2 1023 . 18.7726 104.766 102.365 509
consensus peaks seem to take the starting point repeat1 and the end point repeat. Do you need more information? By the way, I wonder if this problem is because I'm not taking advantage of GeUtilities. I just created the MSPC_configuration.json file and specified it in code. but in this page, you said MSPC uses GeUtilities to parse BED files. But I don't know how to use GeUtilities Thanks again
Hello, I compared the results using inputParserConfiguration = "D:/1Data/genome/MSPC_configuration.json" or without this parameter. Their results are exactly the same. so I check the json file again. I've confirmed that it uses Unix and UTF8 formats, and it looks fine. Do you know of any possible causes?
Thanks!
For the four peaks you provided, mspc computes the chi-squared for the consensus peak to equal 2198.16
. It then calculates the right-tailed probably of 2198.16
with 8
degrees of freedom (two times the number of samples, as this), which results in 0
right-tailed probability. It then converts the right-tailed probability to Phred scale (i.e., -Log10
format), and returns inf
for the right-tailed probably of 0
in Phred scale (see https://github.com/Genometric/MSPC/issues/131 discuss for details).
You can check these numbers in the *_mspc_peaks.txt
files.
Please double-check if the 8th column contains the p-values. The values in the samples you shared above are ~2.6e-114
, ~3.8e-165
, ~2.6e-95
, and ~1.7e-105
. Based on your experiment, do you expect such significant p-values? if not, maybe your data is in -10 Log10
format (you can set the format as documented here).
I just created the MSPC_configuration.json file and specified it in code. but in this page, you said MSPC uses GeUtilities to parse BED files.
MSPC uses GeUtilities internally, and you do not need to modify anything.
Thank you for your reply.
I get narrowpeaks from macs2. and i confirm the 8th column is -log10(P-value), and the 5th column is -10log10(P-value). i found that when it has inf value, the P value is going to be huge, just like https://github.com/Genometric/MSPC/issues/131. So my narrowPeaks don't have question.
so it looks like MSPC_configuration.json file had worked when i use it.
{ "Chr":0, "Left":1, "Right":2, "Name":3, "Strand": 5, "Value":7, "PValueFormat":1, "Culture":"en-US" }
After reading the discuss (https://github.com/Genometric/MSPC/issues/131 ). I want to know:
1. The inf number means peak repeatability is good, right?
2. Then I would like to know, when I use this bed file for subsequent analysis, such as looking for differential peaks analysis, motif and footprint. inf values, whether it will make a difference (I don't know, because the letters may be wrong in the calculation). Or it may not be used at all in subsequent analyses (as I mentioned above).
3. which i should use between ConsensusPeaks.bed and ConsensusPeaks_mspc_peaks.txt (it seems to have only a few more columns). Can you give me a link to the intermediate output file about explain them?
Thanks again for your great job !!!
i confirm the 8th column is -log10(P-value), and the 5th column is -10log10(P-value)
You might be better off using the 5th column IMO.
- The inf number means peak repeatability is good, right?
It means the p-value is 0.0
and cannot be written to the output in the Phred scale (i.e., -Log 10
). I think it would be safe to say p-value 0
is considered very significant.
- Then I would like to know, when I use this bed file for subsequent analysis, such as looking for differential peaks analysis, motif and footprint. inf values, whether it will make a difference (I don't know, because the letters may be wrong in the calculation). Or it may not be used at all in subsequent analyses (as I mentioned above).
I think most tools in your downstream analysis will either drop peaks with inf
p-value or will require that you manually remove them.
If you have many peaks with an inf
p-value, I recommend you double-check that the data you are using is the correct data, and in the correct format (e.g., -Log10
vs. -10 Log10
).
- which i should use between ConsensusPeaks.bed and ConsensusPeaks_mspc_peaks.txt (it seems to have only a few more columns). Can you give me a link to the intermediate output file about explain them?
ConsensusPeaks_mspc_peaks.txt
contains all the information in the ConsensusPeaks.bed
plus some additional information (e.g., chi-squared
and the right-tailed probability
, and Benjamini-Hochberg corrected p-value). These are separated because ConsensusPeaks.bed
is in standard BED format.
Dear developer,
Thank you for providing this brilliant package to us. I have a question about p-value in
ConsensusPeaks.bed
file. The fifth column is -log(pvalue), and I got manyinf
value in this column, which means the peaks are extremely significant? (If I am right). I show these values to my collaborator, and they want the original p-value of these peaks. And I'd like to ask is it possible? Thank you in advance, and have a good day.