Closed annaquaglieri16 closed 7 years ago
Can you post the bigWig and BED files somewhere so I can use them for testing? You could upload them to the deeptools Galaxy server and let me know the email address you used (I'll be able to find the files then).
Do you mean that I should upload the bigwig file here? (screenshot below). I have never use this before and they are not my data and they are not published. I am not sure I can upload them, I need to wait for permission. However, it is a simple bigWig file created with the following code:
bamCoverage \ --bam ${inputdir}/merged.H3K14ac.CN.gc.bam \ --outFileName ${outdir}/merged.H3K14ac.CN.gc_scaled.bigWig \ --outFileFormat bigwig \ --binSize 100 \ --numberOfProcessors $(nproc) \ --normalizeTo1x 2305000000
and this is the format of the .bed files containing the regions across which I need the ChIP-Seq profiles. Does it look in the correct format?
chr1 3670635 3670729 IS2 Other
chr1 3670747 3671112 IS3 Other
chr1 3671222 3671419 IS4 Other
chr1 3671919 3672158 IS5 Other
chr1 3672167 3672277 IS6 Other
chr1 4784974 4785099 IS15 Other
and the other one like this
chr1 3445624 3445756 IS1
chr1 3921316 3921425 IS7
chr1 4454971 4455230 IS8
chr1 4492199 4492619 IS9
chr1 4492991 4493276 IS10
chr1 4503763 4504045 IS11
I tried a similar thing on some test datasets available on the deepTools Galaxy server. I used H3K9Me3.bigWig and genes.bed and I split the bed file in two. I attached my script and I got slightly different results, as expected, using mean,median,max (figures starting with test_genes* attached here). Therefore, I think it should be fine but when I try it with my data, same script, I get no difference in the median,max,mean plots even though the counts on my features are way more skewed that the ones in the test dataset that I used. Do you think that it could be a wrong specification of the bed file?
----- Original Message -----
From: "Devon Ryan" < notifications@github.com > To: "fidelram/deepTools" < deepTools@noreply.github.com > Cc: "Anna Quaglieri" < quaglieri.a@wehi.edu.au >, "Author" < author@noreply.github.com > Sent: Tuesday , 8 November, 2016 6:16:47 PM Subject: Re: [fidelram/deepTools] --averageTypeBins issues (#443)
Can you post the bigWig and BED files somewhere so I can use them for testing? You could upload them to the deeptools Galaxy server and let me know the email address you used (I'll be able to find the files then).
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub , or mute the thread .
Anna Quaglieri PhD student Division of Bioinformatics - Speed lab
Walter & Eliza Hall Institute of Medical Research 1G Royal Parade, Parkville VIC 3052, Australia
Contact: +614 6892 5003
---I'm an Athena SWAN Advocate. I call out sexism and promote equality.---
The information in this email is confidential and intended solely for the addressee. You must not disclose, forward, print or use it without the permission of the sender.
Yes, please upload the bigWig file somewhere so I can use it for debugging. I don't see anything wrong in the BED file or your bamCoverage command.
The line where the average is computed is this:
https://github.com/fidelram/deepTools/blob/master/deeptools/heatmapper_utilities.py#L64
and it seems that the arguments are being passed correctly to the function.
Could it be that for that set of data the mean is quite similar to the median. Looking carefully at the two plots I can see some tiny differences.
Can you compare the output of --outFileNameData
for the mean an median cases?
Hi Fidel,
thanks for your time. I attached here a really brief summary to show why I am puzzled. I included a plot comparing max-median-mean obtained from the --outFileNameData (also attached here). I see that there is a slight difference between the tree curves, but I am really puzzled that the difference is so tiny. The reason why I am puzzled is that when I use the multiBigwigSummary function to summarise the counts on all the features I obtained the distribution in Figure 1 in the word document.
I had a look at the code. Is there any way to output the 'ma' matrix before it is summarised.
Thanks for your help, Anna
----- Original Message -----
From: "Fidel Ramirez" notifications@github.com To: "fidelram/deepTools" deepTools@noreply.github.com Cc: "Anna Quaglieri" quaglieri.a@wehi.edu.au, "Author" author@noreply.github.com Sent: Thursday, 10 November, 2016 1:21:33 AM Subject: Re: [fidelram/deepTools] --averageTypeBins issues (#443)
The line where the average is computed is this:
https://github.com/fidelram/deepTools/blob/master/deeptools/heatmapper_utilities.py#L64
and it seems that the arguments are being passed correctly to the function.
Could it be that for that set of data the mean is quite similar to the median. Looking carefully at the two plots I can see some tiny differences.
Can you compare the output of --outFileNameData for the mean an median cases?
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub , or mute the thread .
Anna Quaglieri PhD student Division of Bioinformatics - Speed lab
Walter & Eliza Hall Institute of Medical Research 1G Royal Parade, Parkville VIC 3052, Australia
Contact: +614 6892 5003
---I'm an Athena SWAN Advocate. I call out sexism and promote equality.---
The information in this email is confidential and intended solely for the addressee. You must not disclose, forward, print or use it without the permission of the sender.
I don't see the attachment, but you can get the 'ma' matrix using heatmapper option --outFileNameMatrix
Hi, thanks for this. I still struggle to understand where is the issue. I did a test with some test data that I took from the deepTools Galaxy server, so that you could access as well. I downloaded H3K9Me3.bigWig and genes.bed as ranges. I only used a subset of the initial set of genes, but that won't change the conclusions. Here I attached a plot showing:
R
ma_test <- read.table(file.path(dir,"matrix_TEST"),header=F,skip=1)
# apply statistics to the column of ma_test
ma_median_test <- apply(ma_test,2,median)
ma_mean_test <- apply(ma_test,2,mean)
ma_max_test <- apply(ma_test,2,max)
par(mfrow=c(2,1))
plot(as.numeric(meanp[1,3:ncol(meanp)]),as.numeric(meanp[2,3:ncol(meanp)]),type="l",main="deepTools OutfileNameData",ylim=c(0.4,3))
points(as.numeric(medianp[1,3:ncol(medianp)]),as.numeric(medianp[2,3:ncol(medianp)]),type="l",col="red")
points(as.numeric(maxp[1,3:ncol(maxp)]),as.numeric(maxp[2,3:ncol(maxp)]),type="l",col="purple")
legend("topleft",legend=c("Mean","Median","Max"),col=c("black","red","purple"),pch=c(16,16,16))
plot(1:length(ma_median_test),ma_median_test,type="l",main="Test data with R function",ylim=c(0,10))
lines(1:length(ma_mean_test),ma_mean_test,col="red")
lines(1:length(ma_max_test),ma_max_test,col="purple")
legend("topleft",legend=c("Mean","Median","Max"),col=c("black","red","purple"),pch=c(16,16,16))
Is there any other steps, other than this summary stats for each column, that is applied to the 'ma' matrix? I am not a python expert but did not see that into the code.
Below is is what I get using the same code for my data. Looking at median does lead to a very different conclusion there are few outliers in those regions.
Thanks for your time, Anna
maybe 0 or NA's in the bigwig/matrix handled differently, e.g considered and converted to 0 in one case and removed in the other before computing the median!?
steffen
On 11.11.2016 05:50, Anna Quaglieri wrote:
Hi, thanks for this. I still struggle to understand where is the issue. I did a test with some test data that I took from the deepTools Galaxy server, so that you could access as well. I downloaded H3K9Me3.bigWig and genes.bed as ranges. I only used a subset of the initial set of genes, but that won't change the conclusions. Here I attached a plot showing:
- (top plot) the profile plots plotted in R using the --outFileNameData in plotProfile for three different output from computeMatrix using mean, median and max
- (bottom plot) is the same plot but I computed the summary stats with R. I output the matrix with --outFileNameMatrix and in R I ran this simple code:
R ma_test <- read.table(file.path(dir,"matrix_TEST"),header=F,skip=1)
apply statistics to the column of ma_test
ma_median_test <- apply(ma_test,2,median) ma_mean_test <- apply(ma_test,2,mean) ma_max_test <- apply(ma_test,2,max)
par(mfrow=c(2,1)) plot(as.numeric(meanp[1,3:ncol(meanp)]),as.numeric(meanp[2,3:ncol(meanp)]),type="l",main="deepTools OutfileNameData",ylim=c(0.4,3)) points(as.numeric(medianp[1,3:ncol(medianp)]),as.numeric(medianp[2,3:ncol(medianp)]),type="l",col="red") points(as.numeric(maxp[1,3:ncol(maxp)]),as.numeric(maxp[2,3:ncol(maxp)]),type="l",col="purple") legend("topleft",legend=c("Mean","Median","Max"),col=c("black","red","purple"),pch=c(16,16,16))
plot(1:length(ma_median_test),ma_median_test,type="l",main="Test data with R function",ylim=c(0,10)) lines(1:length(ma_mean_test),ma_mean_test,col="red") lines(1:length(ma_max_test),ma_max_test,col="purple") legend("topleft",legend=c("Mean","Median","Max"),col=c("black","red","purple"),pch=c(16,16,16))
test_data_compare.pdf https://github.com/fidelram/deepTools/files/585030/test_data_compare.pdf
Is there any other steps, other than this summary stats for each column, that is applied to the 'ma' matrix? I am not a python expert but did not see that into the code.
Below is is what I get using the same code for my data. Looking at median does lead to a very different conclusion there are few outliers in those regions.
my_data_compare.pdf https://github.com/fidelram/deepTools/files/585031/my_data_compare.pdf
Thanks for your time, Anna
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/fidelram/deepTools/issues/443#issuecomment-259880737, or mute the thread https://github.com/notifications/unsubscribe-auth/AKKL8hiw-q244IKcFVEf16UTOoT08xoaks5q8_QLgaJpZM4KsFqi.
Hi,
I understood what is the issue. If you only specify --averageTypeBins 'median' in the computeMatrix function it does not work properly, you also need to re-specify --averageType 'median' in the plotProfile.
When I specify it only in the plotProfile function I get this
when I specify it in both the computeMatrix matrix and the plotProfile I get this
They are pretty much the same, only slightly different.
Do you have any comment about why it needs to be specified twice, and what does it change? Does it compute the statistics a second time using another binning?
Thanks for your support, Anna
This would have been more apparent had you used plotHeatmap instead. The output of computeMatrix is a large matrix with columns samples and rows genes/transcripts/features. When you tell computeMatrix to use the median, then for each bin in each row that it outputs, it stores the median value for a given gene/transcript/feature in a given sample. You then additionally wanted the median of all bins in a given column of a each sample in your profile, which necessitates specifying the median again (since otherwise you might have wanted the max or min or mean). That's why you need to specify it twice, there are two levels of summarization.
Anyway, thanks for replying that this is what solved your problem!
You're welcome, thanks for your help! Anna
Hello,
I am having troubles with the --averageTypeBins in the computeMatrix function. I need to plot the coverage of bigWig files over two different sets of regions +- 2kb by aggregating the coverage at every position using the median across regions.
I use --averageTypeBins median but the output is always plotting the mean at every position. I am using the following code:
Near Far - Median
computeMatrix scale-regions \ -S ${inputdir}/merged.H3K14ac.CN.gc_scaled.bigWig \ ${inputdir}/merged.HBO1.CN.gc_scaled.bigWig \ -R ${genedir}/IS_over_TSS_CN.bed ${genedir}/IS_not_over_TSS_CN.bed \ --beforeRegionStartLength 2000 \ --afterRegionStartLength 2000 \ --averageTypeBins median \ -p $(nproc) \ -o ${outdir}/matrix_CN_gc_IS_near_far_median.mat.gz
and
Near Far - Mean
computeMatrix scale-regions \ -S ${inputdir}/merged.H3K14ac.CN.gc_scaled.bigWig \ ${inputdir}/merged.HBO1.CN.gc_scaled.bigWig \ -R ${genedir}/IS_over_TSS_CN.bed ${genedir}/IS_not_over_TSS_CN.bed \ --beforeRegionStartLength 2000 \ --afterRegionStartLength 2000 \ --averageTypeBins mean \ -p $(nproc) \ -o ${outdir}/matrix_CN_gc_IS_near_far_median.mat.gz
and the output looks the same:
MEAN
MEDIAN
I am pretty sure that it is plotting the mean because when I use the multiBigwigSummary with the same input bigWig files I obtain the following statistics (where the mean of the far TSS group is around 40):
1 FarTSS 1.323529 394945.5384 37.85658 2 NearTSS 1.024173 329.2038 1.32402
Am I using the argument in the wrong way? I also tried to use the argument --binSize 10 in the computeMatrix function even though the data have already been binned into 100 bp non overlapping bins.
Thanks for your help.