Closed rojasp closed 7 years ago
I don't think there's a way to directly calculate and visualize the TR, unless there's some computeMatrixOperations
magic that @dpryan79 may know more about.
If I understand correctly, you need to obtain two values per gene: (1) the promoter coverage and (2) the gene body coverage - and then you will have to calculate the ratio yourself. I haven't really played this through, but if you wanted to stay with deepTools for this, I'd be tempted to use multiBigWigSummary BED-file
instead of computeMatrix
.
awk
or bedtools
multiBigWigSummary BED-file
with either BED-file, this will give you a coverage score for each gene region (make sure yo use option --outRawCounts
)paste promoterCounts.tab geneBodyCounts.tab | awk '{print $1/$2}' > TRs.tab
(there may be some header lines you need to get rid of etc., haven't actually tested this)I've done something similar to this, but I've used a mixture of computeMatrix and the deepTools API.
Something along the following lines should work with your computeMatrix output (this is mostly copy-pasted from a Snakefile of mine, so you'll need to change things like input[0]
):
from deeptools import heatmapper
import numpy as np
hm = heatmapper.heatmapper()
hm.read_matrix_file(input[0])
def doCalc(x, sampleWidth=0):
o = [np.nan, np.nan]
s1 = 0
e1 = 12
o[0] = np.nansum(x[s1:])
o[1] = np.nansum(x[s1:e1]).astype('float') / np.nansum(x[e1:]).astype('float')
return o
out = np.apply_along_axis(doCalc, 1, hm.matrix.matrix)
of = open(output[0], "w")
of.write("chrom\tstart\tend\tname\tscore\tstrand")
for label in hm.parameters['sample_labels']:
of.write("\t{}_Sum\t{}_Ratio".format(label, label))
of.write("\n")
for reg, val in zip(hm.matrix.regions, out):
of.write("{}\t{}\t{}\t{}\t{}\t{}\t".format(reg["chrom"], reg["start"], reg["end"], reg["name"], reg["score"], reg["strand"]))
of.write("\t".join(["{}".format(x) for x in val]))
of.write("\n")
of.close()
I used a more complicated version of that to calculate "5' loading" from GROseq data for one of our groups. You'll have to double check that what I wrote is correct, since I had to change it a bit to match what you're doing.
BTW, what @friedue suggested is another nice option. Essentially you produce two matrices with one of the multi*Summary tools and then divide them in python (they're numpy matrices). Note that you'll need to normalize the gene body matrix yourself.
what do you mean by "normalizing the gene body matrix"?
My presumption is that the "transcribed region" should be normalized to some given length (~2.7kb in the original post) so that different transcripts are comparable. Otherwise you get a bias by transcript length.
but multiBigWigSummary should return the average coverage, not the sum? so that shouldn't be too prone to length-dependent artifacts? haven't given this a whole lot of thought though
multiBigWigSummary will produce the average, multiBamSummary the total number. The former would be fine then.
that's an interesting distinction between the two tools that I was only subconsciously aware of
Hi guys, Thank you very much for all this comments. @dpryan79 , as I don't have any experience with Snakemake, I'm going to start as @friedue suggest and I will let you now. The idea is to be able to create a plot as follow
I will have all the consideration that you highlights.
Hi, Im trying to carry out a Pol II traveling ratio analysis ("The promoter-proximal bin is defined using a fixed window from 30 bp to +300 bp around the annotated start site. The transcribed region (gene body) bin is from +300 bp to the annotated end. The TR is the ratio of Pol II density in the promoter-proximal bin to the Pol II density in the transcribed region bin"). To do that, I normalized my bam file as bamCoverage -b file.BAM --normalizeUsingRPKM -o fileRPKM -of bigwig
Then I run computeMatrix scale-regions --beforeRegionStartLength 30 --unscaled5prime 330 -m 3000 -bs 30 -R tss -S fileRPKM -bl DACblacklist.bed.gz --skipZeros -o matrix_TR.gz
PlotProfile --matrixFile matrix_TR.gz --outFileSortedRegions sort_TR --averageType mean --yAxisLabel coverage --perGroup --outFileName Traveling_ratio --outFileNameData Traveling_ratio.tsv
But I am a bit lost to perform the further analysis. Could you give me any tip?
Thanks