Intribuing behavior of plotCorrelation with mixture of single-end and paired-end datasets

AlexBlais74 commented 4 years ago

Hello I am using deepTools 3.3.2 with Python 3.7.0

I noticed something intriguing when using plotCorrelation (after multiBigwigSummary), specifically when the various bigwigs are a mixture of single and paired-end sequencing datasets: the dots for the scatter plots vary in size. See below, the datasets marked "S" are single end, those marked "P" are paire-end.

You see that the dimensions of the dots vary, based on whether the data are single or paired. If I run the multiBigwigSummary with just the paired data, and then plotCorrelation using the exact same plotting parameters, I get the following (note that the correlation coefficient has changed very slightly)

With only the single-end data, I get this:

The command to generate all the bigWigs used the following options. Note that the estimation of insert sizes for the paired-end sets gave a peak near 150 bp (with SSP) and a median of 160 (with deepTools). I should specify all 4 datasets are from ChIP-seq for a histone mark on MNase-treated chromatin. --normalizeUsing CPM \ --binSize 10 \ --extendReads 150 \

The command for multiBigwigSummary has the following parameters multiBigwigSummary bins \ --binSize 2500 \ --distanceBetweenBins 0 \

The command for the plotCorrelation has the following options: --corMethod pearson \ --whatToPlot scatterplot \ --removeOutliers \ --xRange 0 1 \ --yRange 0 1 \

Inspection of the bigwig files on the IGV browser shows nothing abnormal at first glance:

I would be interested in knowing what is going on here. Am I making a mistake by comparing single- and paired-end data? Otherwise, perhaps there is a way for me to fix this issue?

Thanks for your help.

Alex

LeilyR commented 4 years ago

Hi, I am wondering if this is a result of having less non-zero bins in your paired end data. Could you please check the raw output of of multiBigWigSummary and see if there are more nan or zero on your paired end data?

AlexBlais74 commented 4 years ago

Hi, thanks for your reply. Here are some numbers from examining the counts table single_1 number of '0.0': 97792 single_2 number of '0.0': 96358 paired_1 number of '0.0': 81268 paired_2 number of '0.0': 82345

Rows with 0.0 in both single_1 and paired_1 79050 Rows with 0.0 in single_1 but NOT in paired_1 18742 Rows with 0.0 in paired_1 but NOTin single_1 2218

There are 1404 rows with "nan" values and they have that value in all 4 samples (most likely from blacklisting when the bw files were created?).

EDIT: total of 1090323 rows in the file.

Alex

LeilyR commented 4 years ago

I see, this is not what I have expected. But from your figures, it really seems that you have less bins to plot for paired end data, which makes me think that there might be neighbouring bins of the same value which got merged on the paired data, could that be the case? I don't think it has to do with plotting your paired and single data together, I can see the difference between the single and paired data even when they have been plotted separately.

AlexBlais74 commented 4 years ago

Hello You wrote "there might be neighbouring bins of the same value which got merged on the paired data, could that be the case? " Could you suggest a way for me to check that?

deeptools / deepTools

Intribuing behavior of plotCorrelation with mixture of single-end and paired-end datasets #954