Is it possible to get original pvalue in consensus peak file?

nzx9581 commented 2 years ago

Dear developer,

Thank you for providing this brilliant package to us. I have a question about p-value in ConsensusPeaks.bed file. The fifth column is -log(pvalue), and I got many inf value in this column, which means the peaks are extremely significant? (If I am right). I show these values to my collaborator, and they want the original p-value of these peaks. And I'd like to ask is it possible? Thank you in advance, and have a good day.

VJalili commented 2 years ago

Thanks for using mspc, @nzx9581!

Consensus peaks are constructed based on the true-positive peaks of all the replicates provided in the input. Accordingly, one peak in the consensus set represents 1 to many peaks in the true-positive sets; hence, its p-value does not necessarily correlate to the p-value of a peak in the input. It only correlates, iff it represents a single peak in the true-positive set. Please see the example on this page, it may better explain how these peaks are constructed.

If you are seeing more peaks with inf in the output than you expect, it could be caused by either or both of the following:

Are you sure your input data is in the default format MSPC expects? where the p-value is on the 5-th column. If your inputs are not in this format, you can adjust MPSC to read your data properly using parser configuration;
Are you sure your p-values are in the -log10 format? if not, you can adjust the parser to read the peaks w.r.t. to the format of your data.

Please check these and let me know if they resolve the issue, otherwise, I'm happy to help you debug the issue.

nzx9581 commented 2 years ago

Thank you for your reply.

I checked my input files according to your suggestion, and finally I found the pvalue(-log10) is in the 8th column of MACS2 .narrowPeak files which I used as the input of MSPC before, however, MSPC only accepts the input files where the 5th column should be pvalue. Now I think I have got the right results. Thank you so much! I will close this issue.

VJalili commented 2 years ago

Glad that was useful. As I mentioned above you should be able to adjust mspc to read the p-value from a different column that best matches your data.

li1311139481 commented 1 year ago

hello i'm using rmspc. and I had the same problem about inf value .

my input is narrowPeak file from macs2

so the pvalue is column 7. the -log10(p-value) of first peak is 6.29862. Right?

Then I create a MSPC_configuration.json file like: { "Chr":0, "Left":1, "Right":2, "Name":3, "Strand": 5, "Value":7, "PValueFormat":1, "Culture":"en-US" } my code is results_WT <- mspc(input = WT, replicateType = "Bio", stringencyThreshold = 1e-8, weakThreshold = 1e-4, gamma = 1e-8, keep = TRUE, GRanges = TRUE, multipleIntersections = "Lowest", c = 2, alpha = 0.05, outputPath = output, degreeOfParallelism = 8, inputParserConfiguration = "D:/1Data/genome/MSPC_configuration.json")

But there's still a lot inf value in my results.
How to solve it? thank you

VJalili commented 1 year ago

Could you please share a reproducible example? e.g., a few overlapping peaks across your replicates that, when passed to mspc, it returns inf. One way to get this would be to start from one peak in the output with p-value inf, then find all the peaks in your input replicates that overlap that output peak.

li1311139481 commented 1 year ago

Could you please share a reproducible example? e.g., a few overlapping peaks across your replicates that, when passed to mspc, it returns inf. One way to get this would be to start from one peak in the output with p-value inf, then find all the peaks in your input replicates that overlap that output peak.

Thanks for your reply. I check my ConsensusPeak.bed file to find an example For example, consensusPeak file in output like this chr1 4495964 4497626 MSPC_Peak_54519 inf and I check the position of this peak, from 4495964 to 4497626. Then I find peaks in input narrowPeak file according to the position. repeat 1 chr1 4495964 4497620 GSM4871342_peak_5 1111 . 18.8488 113.574 111.144 544 repeat 2 chr1 4495979 4497626 GSM4871344_peak_3 1618 . 23.9207 164.41 161.897 519 repeat 3 chr1 4496010 4497467 GSM4871346_peak_4 921 . 18.4848 94.5751 92.1984 520 repeat 4 chr1 4496010 4497393 GSM4871348_peak_2 1023 . 18.7726 104.766 102.365 509

consensus peaks seem to take the starting point repeat1 and the end point repeat. Do you need more information? By the way, I wonder if this problem is because I'm not taking advantage of GeUtilities. I just created the MSPC_configuration.json file and specified it in code. but in this page, you said MSPC uses GeUtilities to parse BED files. But I don't know how to use GeUtilities Thanks again

li1311139481 commented 1 year ago

Hello, I compared the results using inputParserConfiguration = "D:/1Data/genome/MSPC_configuration.json" or without this parameter. Their results are exactly the same. so I check the json file again. I've confirmed that it uses Unix and UTF8 formats, and it looks fine. Do you know of any possible causes?

VJalili commented 1 year ago

Thanks!

For the four peaks you provided, mspc computes the chi-squared for the consensus peak to equal 2198.16. It then calculates the right-tailed probably of 2198.16 with 8 degrees of freedom (two times the number of samples, as this), which results in 0 right-tailed probability. It then converts the right-tailed probability to Phred scale (i.e., -Log10 format), and returns inf for the right-tailed probably of 0 in Phred scale (see https://github.com/Genometric/MSPC/issues/131 discuss for details).

You can check these numbers in the *_mspc_peaks.txt files.

Please double-check if the 8th column contains the p-values. The values in the samples you shared above are ~2.6e-114, ~3.8e-165, ~2.6e-95, and ~1.7e-105. Based on your experiment, do you expect such significant p-values? if not, maybe your data is in -10 Log10 format (you can set the format as documented here).

I just created the MSPC_configuration.json file and specified it in code. but in this page, you said MSPC uses GeUtilities to parse BED files.

MSPC uses GeUtilities internally, and you do not need to modify anything.

li1311139481 commented 1 year ago

Thank you for your reply. I get narrowpeaks from macs2. and i confirm the 8th column is -log10(P-value), and the 5th column is -10log10(P-value). i found that when it has inf value, the P value is going to be huge, just like https://github.com/Genometric/MSPC/issues/131. So my narrowPeaks don't have question. so it looks like MSPC_configuration.json file had worked when i use it. { "Chr":0, "Left":1, "Right":2, "Name":3, "Strand": 5, "Value":7, "PValueFormat":1, "Culture":"en-US" } After reading the discuss (https://github.com/Genometric/MSPC/issues/131 ). I want to know: 1. The inf number means peak repeatability is good, right? 2. Then I would like to know, when I use this bed file for subsequent analysis, such as looking for differential peaks analysis, motif and footprint. inf values, whether it will make a difference (I don't know, because the letters may be wrong in the calculation). Or it may not be used at all in subsequent analyses (as I mentioned above). 3. which i should use between ConsensusPeaks.bed and ConsensusPeaks_mspc_peaks.txt (it seems to have only a few more columns). Can you give me a link to the intermediate output file about explain them?

Thanks again for your great job !!!

VJalili commented 1 year ago

i confirm the 8th column is -log10(P-value), and the 5th column is -10log10(P-value)

You might be better off using the 5th column IMO.

The inf number means peak repeatability is good, right?

It means the p-value is 0.0 and cannot be written to the output in the Phred scale (i.e., -Log 10). I think it would be safe to say p-value 0 is considered very significant.

Then I would like to know, when I use this bed file for subsequent analysis, such as looking for differential peaks analysis, motif and footprint. inf values, whether it will make a difference (I don't know, because the letters may be wrong in the calculation). Or it may not be used at all in subsequent analyses (as I mentioned above).

I think most tools in your downstream analysis will either drop peaks with inf p-value or will require that you manually remove them.

If you have many peaks with an inf p-value, I recommend you double-check that the data you are using is the correct data, and in the correct format (e.g., -Log10 vs. -10 Log10).

which i should use between ConsensusPeaks.bed and ConsensusPeaks_mspc_peaks.txt (it seems to have only a few more columns). Can you give me a link to the intermediate output file about explain them?

ConsensusPeaks_mspc_peaks.txt contains all the information in the ConsensusPeaks.bed plus some additional information (e.g., chi-squared and the right-tailed probability, and Benjamini-Hochberg corrected p-value). These are separated because ConsensusPeaks.bed is in standard BED format.

Genometric / MSPC

Is it possible to get original pvalue in consensus peak file? #178