deeptools / deepTools

Tools to process and analyze deep sequencing data.
Other
683 stars 211 forks source link

--averageTypeBins issues #443

Closed annaquaglieri16 closed 7 years ago

annaquaglieri16 commented 8 years ago

Hello,

I am having troubles with the --averageTypeBins in the computeMatrix function. I need to plot the coverage of bigWig files over two different sets of regions +- 2kb by aggregating the coverage at every position using the median across regions.
I use --averageTypeBins median but the output is always plotting the mean at every position. I am using the following code:

Near Far - Median

computeMatrix scale-regions \ -S ${inputdir}/merged.H3K14ac.CN.gc_scaled.bigWig \ ${inputdir}/merged.HBO1.CN.gc_scaled.bigWig \ -R ${genedir}/IS_over_TSS_CN.bed ${genedir}/IS_not_over_TSS_CN.bed \ --beforeRegionStartLength 2000 \ --afterRegionStartLength 2000 \ --averageTypeBins median \ -p $(nproc) \ -o ${outdir}/matrix_CN_gc_IS_near_far_median.mat.gz

and

Near Far - Mean

computeMatrix scale-regions \ -S ${inputdir}/merged.H3K14ac.CN.gc_scaled.bigWig \ ${inputdir}/merged.HBO1.CN.gc_scaled.bigWig \ -R ${genedir}/IS_over_TSS_CN.bed ${genedir}/IS_not_over_TSS_CN.bed \ --beforeRegionStartLength 2000 \ --afterRegionStartLength 2000 \ --averageTypeBins mean \ -p $(nproc) \ -o ${outdir}/matrix_CN_gc_IS_near_far_median.mat.gz

and the output looks the same:

MEAN

profile_cn_gc_is_near_far_mean

MEDIAN

profile_cn_gc_is_near_far_median

I am pretty sure that it is plotting the mean because when I use the multiBigwigSummary with the same input bigWig files I obtain the following statistics (where the mean of the far TSS group is around 40):

label   median            max            mean
(chr)    (dbl)       (dbl)    (dbl)

1 FarTSS 1.323529 394945.5384 37.85658 2 NearTSS 1.024173 329.2038 1.32402

Am I using the argument in the wrong way? I also tried to use the argument --binSize 10 in the computeMatrix function even though the data have already been binned into 100 bp non overlapping bins.

Thanks for your help.

dpryan79 commented 8 years ago

Can you post the bigWig and BED files somewhere so I can use them for testing? You could upload them to the deeptools Galaxy server and let me know the email address you used (I'll be able to find the files then).

annaquaglieri16 commented 8 years ago

Do you mean that I should upload the bigwig file here? (screenshot below). I have never use this before and they are not my data and they are not published. I am not sure I can upload them, I need to wait for permission. However, it is a simple bigWig file created with the following code:

bamCoverage \ --bam ${inputdir}/merged.H3K14ac.CN.gc.bam \ --outFileName ${outdir}/merged.H3K14ac.CN.gc_scaled.bigWig \ --outFileFormat bigwig \ --binSize 100 \ --numberOfProcessors $(nproc) \ --normalizeTo1x 2305000000

and this is the format of the .bed files containing the regions across which I need the ChIP-Seq profiles. Does it look in the correct format?

chr1 3670635 3670729 IS2 Other

chr1 3670747 3671112 IS3 Other

chr1 3671222 3671419 IS4 Other

chr1 3671919 3672158 IS5 Other

chr1 3672167 3672277 IS6 Other

chr1 4784974 4785099 IS15 Other

and the other one like this

chr1 3445624 3445756 IS1

chr1 3921316 3921425 IS7

chr1 4454971 4455230 IS8

chr1 4492199 4492619 IS9

chr1 4492991 4493276 IS10

chr1 4503763 4504045 IS11

I tried a similar thing on some test datasets available on the deepTools Galaxy server. I used H3K9Me3.bigWig and genes.bed and I split the bed file in two. I attached my script and I got slightly different results, as expected, using mean,median,max (figures starting with test_genes* attached here). Therefore, I think it should be fine but when I try it with my data, same script, I get no difference in the median,max,mean plots even though the counts on my features are way more skewed that the ones in the test dataset that I used. Do you think that it could be a wrong specification of the bed file?

----- Original Message -----

From: "Devon Ryan" < notifications@github.com > To: "fidelram/deepTools" < deepTools@noreply.github.com > Cc: "Anna Quaglieri" < quaglieri.a@wehi.edu.au >, "Author" < author@noreply.github.com > Sent: Tuesday , 8 November, 2016 6:16:47 PM Subject: Re: [fidelram/deepTools] --averageTypeBins issues (#443)

Can you post the bigWig and BED files somewhere so I can use them for testing? You could upload them to the deeptools Galaxy server and let me know the email address you used (I'll be able to find the files then).

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub , or mute the thread .

Anna Quaglieri PhD student Division of Bioinformatics - Speed lab

Walter & Eliza Hall Institute of Medical Research 1G Royal Parade, Parkville VIC 3052, Australia

Contact: +614 6892 5003

---I'm an Athena SWAN Advocate. I call out sexism and promote equality.---


The information in this email is confidential and intended solely for the addressee. You must not disclose, forward, print or use it without the permission of the sender.


dpryan79 commented 8 years ago

Yes, please upload the bigWig file somewhere so I can use it for debugging. I don't see anything wrong in the BED file or your bamCoverage command.

fidelram commented 8 years ago

The line where the average is computed is this:

https://github.com/fidelram/deepTools/blob/master/deeptools/heatmapper_utilities.py#L64

and it seems that the arguments are being passed correctly to the function.

Could it be that for that set of data the mean is quite similar to the median. Looking carefully at the two plots I can see some tiny differences.

Can you compare the output of --outFileNameData for the mean an median cases?

annaquaglieri16 commented 8 years ago

Hi Fidel,

thanks for your time. I attached here a really brief summary to show why I am puzzled. I included a plot comparing max-median-mean obtained from the --outFileNameData (also attached here). I see that there is a slight difference between the tree curves, but I am really puzzled that the difference is so tiny. The reason why I am puzzled is that when I use the multiBigwigSummary function to summarise the counts on all the features I obtained the distribution in Figure 1 in the word document.

I had a look at the code. Is there any way to output the 'ma' matrix before it is summarised.

Thanks for your help, Anna

----- Original Message -----

From: "Fidel Ramirez" notifications@github.com To: "fidelram/deepTools" deepTools@noreply.github.com Cc: "Anna Quaglieri" quaglieri.a@wehi.edu.au, "Author" author@noreply.github.com Sent: Thursday, 10 November, 2016 1:21:33 AM Subject: Re: [fidelram/deepTools] --averageTypeBins issues (#443)

The line where the average is computed is this:

https://github.com/fidelram/deepTools/blob/master/deeptools/heatmapper_utilities.py#L64

and it seems that the arguments are being passed correctly to the function.

Could it be that for that set of data the mean is quite similar to the median. Looking carefully at the two plots I can see some tiny differences.

Can you compare the output of --outFileNameData for the mean an median cases?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub , or mute the thread .

Anna Quaglieri PhD student Division of Bioinformatics - Speed lab

Walter & Eliza Hall Institute of Medical Research 1G Royal Parade, Parkville VIC 3052, Australia

Contact: +614 6892 5003

---I'm an Athena SWAN Advocate. I call out sexism and promote equality.---


The information in this email is confidential and intended solely for the addressee. You must not disclose, forward, print or use it without the permission of the sender.


fidelram commented 8 years ago

I don't see the attachment, but you can get the 'ma' matrix using heatmapper option --outFileNameMatrix

annaquaglieri16 commented 8 years ago

Hi, thanks for this. I still struggle to understand where is the issue. I did a test with some test data that I took from the deepTools Galaxy server, so that you could access as well. I downloaded H3K9Me3.bigWig and genes.bed as ranges. I only used a subset of the initial set of genes, but that won't change the conclusions. Here I attached a plot showing:

R

ma_test <- read.table(file.path(dir,"matrix_TEST"),header=F,skip=1)

# apply statistics to the column of ma_test
ma_median_test <- apply(ma_test,2,median)
ma_mean_test <- apply(ma_test,2,mean)
ma_max_test <- apply(ma_test,2,max)

par(mfrow=c(2,1))
plot(as.numeric(meanp[1,3:ncol(meanp)]),as.numeric(meanp[2,3:ncol(meanp)]),type="l",main="deepTools OutfileNameData",ylim=c(0.4,3))
points(as.numeric(medianp[1,3:ncol(medianp)]),as.numeric(medianp[2,3:ncol(medianp)]),type="l",col="red")
points(as.numeric(maxp[1,3:ncol(maxp)]),as.numeric(maxp[2,3:ncol(maxp)]),type="l",col="purple")
legend("topleft",legend=c("Mean","Median","Max"),col=c("black","red","purple"),pch=c(16,16,16))

plot(1:length(ma_median_test),ma_median_test,type="l",main="Test data with R function",ylim=c(0,10))
lines(1:length(ma_mean_test),ma_mean_test,col="red")
lines(1:length(ma_max_test),ma_max_test,col="purple")
legend("topleft",legend=c("Mean","Median","Max"),col=c("black","red","purple"),pch=c(16,16,16))

test_data_compare.pdf

Is there any other steps, other than this summary stats for each column, that is applied to the 'ma' matrix? I am not a python expert but did not see that into the code.

Below is is what I get using the same code for my data. Looking at median does lead to a very different conclusion there are few outliers in those regions.

my_data_compare.pdf

Thanks for your time, Anna

steffenheyne commented 8 years ago

maybe 0 or NA's in the bigwig/matrix handled differently, e.g considered and converted to 0 in one case and removed in the other before computing the median!?

steffen

On 11.11.2016 05:50, Anna Quaglieri wrote:

Hi, thanks for this. I still struggle to understand where is the issue. I did a test with some test data that I took from the deepTools Galaxy server, so that you could access as well. I downloaded H3K9Me3.bigWig and genes.bed as ranges. I only used a subset of the initial set of genes, but that won't change the conclusions. Here I attached a plot showing:

  • (top plot) the profile plots plotted in R using the --outFileNameData in plotProfile for three different output from computeMatrix using mean, median and max
  • (bottom plot) is the same plot but I computed the summary stats with R. I output the matrix with --outFileNameMatrix and in R I ran this simple code:

R ma_test <- read.table(file.path(dir,"matrix_TEST"),header=F,skip=1)

apply statistics to the column of ma_test

ma_median_test <- apply(ma_test,2,median) ma_mean_test <- apply(ma_test,2,mean) ma_max_test <- apply(ma_test,2,max)

par(mfrow=c(2,1)) plot(as.numeric(meanp[1,3:ncol(meanp)]),as.numeric(meanp[2,3:ncol(meanp)]),type="l",main="deepTools OutfileNameData",ylim=c(0.4,3)) points(as.numeric(medianp[1,3:ncol(medianp)]),as.numeric(medianp[2,3:ncol(medianp)]),type="l",col="red") points(as.numeric(maxp[1,3:ncol(maxp)]),as.numeric(maxp[2,3:ncol(maxp)]),type="l",col="purple") legend("topleft",legend=c("Mean","Median","Max"),col=c("black","red","purple"),pch=c(16,16,16))

plot(1:length(ma_median_test),ma_median_test,type="l",main="Test data with R function",ylim=c(0,10)) lines(1:length(ma_mean_test),ma_mean_test,col="red") lines(1:length(ma_max_test),ma_max_test,col="purple") legend("topleft",legend=c("Mean","Median","Max"),col=c("black","red","purple"),pch=c(16,16,16))

test_data_compare.pdf https://github.com/fidelram/deepTools/files/585030/test_data_compare.pdf

Is there any other steps, other than this summary stats for each column, that is applied to the 'ma' matrix? I am not a python expert but did not see that into the code.

Below is is what I get using the same code for my data. Looking at median does lead to a very different conclusion there are few outliers in those regions.

my_data_compare.pdf https://github.com/fidelram/deepTools/files/585031/my_data_compare.pdf

Thanks for your time, Anna

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/fidelram/deepTools/issues/443#issuecomment-259880737, or mute the thread https://github.com/notifications/unsubscribe-auth/AKKL8hiw-q244IKcFVEf16UTOoT08xoaks5q8_QLgaJpZM4KsFqi.

annaquaglieri16 commented 7 years ago

Hi,

I understood what is the issue. If you only specify --averageTypeBins 'median' in the computeMatrix function it does not work properly, you also need to re-specify --averageType 'median' in the plotProfile.

When I specify it only in the plotProfile function I get this

matrix_cn_gc_is_near_far_median

when I specify it in both the computeMatrix matrix and the plotProfile I get this

matrix_cn_gc_is_near_far_median1

They are pretty much the same, only slightly different.

Do you have any comment about why it needs to be specified twice, and what does it change? Does it compute the statistics a second time using another binning?

Thanks for your support, Anna

dpryan79 commented 7 years ago

This would have been more apparent had you used plotHeatmap instead. The output of computeMatrix is a large matrix with columns samples and rows genes/transcripts/features. When you tell computeMatrix to use the median, then for each bin in each row that it outputs, it stores the median value for a given gene/transcript/feature in a given sample. You then additionally wanted the median of all bins in a given column of a each sample in your profile, which necessitates specifying the median again (since otherwise you might have wanted the max or min or mean). That's why you need to specify it twice, there are two levels of summarization.

Anyway, thanks for replying that this is what solved your problem!

annaquaglieri16 commented 7 years ago

You're welcome, thanks for your help! Anna