question about scale-region --unscaled5prime for traveling ratios

rojasp commented 7 years ago

Hi, Im trying to carry out a Pol II traveling ratio analysis ("The promoter-proximal bin is defined using a fixed window from 30 bp to +300 bp around the annotated start site. The transcribed region (gene body) bin is from +300 bp to the annotated end. The TR is the ratio of Pol II density in the promoter-proximal bin to the Pol II density in the transcribed region bin"). To do that, I normalized my bam file as bamCoverage -b file.BAM --normalizeUsingRPKM -o fileRPKM -of bigwig

Then I run computeMatrix scale-regions --beforeRegionStartLength 30 --unscaled5prime 330 -m 3000 -bs 30 -R tss -S fileRPKM -bl DACblacklist.bed.gz --skipZeros -o matrix_TR.gz

PlotProfile --matrixFile matrix_TR.gz --outFileSortedRegions sort_TR --averageType mean --yAxisLabel coverage --perGroup --outFileName Traveling_ratio --outFileNameData Traveling_ratio.tsv

But I am a bit lost to perform the further analysis. Could you give me any tip?

Thanks

friedue commented 7 years ago

I don't think there's a way to directly calculate and visualize the TR, unless there's some computeMatrixOperations magic that @dpryan79 may know more about.

If I understand correctly, you need to obtain two values per gene: (1) the promoter coverage and (2) the gene body coverage - and then you will have to calculate the ratio yourself. I haven't really played this through, but if you wanted to stay with deepTools for this, I'd be tempted to use multiBigWigSummary BED-file instead of computeMatrix.

generate two BED files, one for the promoter regions, one for the gene bodies. this could be done either with awk or bedtools
use multiBigWigSummary BED-file with either BED-file, this will give you a coverage score for each gene region (make sure yo use option --outRawCounts)
if the order of the 2 BED files is the same, you could simply do paste promoterCounts.tab geneBodyCounts.tab | awk '{print $1/$2}' > TRs.tab (there may be some header lines you need to get rid of etc., haven't actually tested this)

dpryan79 commented 7 years ago

I've done something similar to this, but I've used a mixture of computeMatrix and the deepTools API.

Something along the following lines should work with your computeMatrix output (this is mostly copy-pasted from a Snakefile of mine, so you'll need to change things like input[0]):

from deeptools import heatmapper
import numpy as np
hm = heatmapper.heatmapper()
hm.read_matrix_file(input[0])

def doCalc(x, sampleWidth=0):
    o = [np.nan, np.nan]
    s1 = 0
    e1 = 12
    o[0] = np.nansum(x[s1:])
    o[1] = np.nansum(x[s1:e1]).astype('float') / np.nansum(x[e1:]).astype('float')
    return o
out = np.apply_along_axis(doCalc, 1, hm.matrix.matrix)

of = open(output[0], "w")
of.write("chrom\tstart\tend\tname\tscore\tstrand")
for label in hm.parameters['sample_labels']:
    of.write("\t{}_Sum\t{}_Ratio".format(label, label))
of.write("\n")

for reg, val in zip(hm.matrix.regions, out):
    of.write("{}\t{}\t{}\t{}\t{}\t{}\t".format(reg["chrom"], reg["start"], reg["end"], reg["name"], reg["score"], reg["strand"]))
    of.write("\t".join(["{}".format(x) for x in val]))
    of.write("\n")
of.close()

I used a more complicated version of that to calculate "5' loading" from GROseq data for one of our groups. You'll have to double check that what I wrote is correct, since I had to change it a bit to match what you're doing.

dpryan79 commented 7 years ago

BTW, what @friedue suggested is another nice option. Essentially you produce two matrices with one of the multi*Summary tools and then divide them in python (they're numpy matrices). Note that you'll need to normalize the gene body matrix yourself.

friedue commented 7 years ago

what do you mean by "normalizing the gene body matrix"?

dpryan79 commented 7 years ago

My presumption is that the "transcribed region" should be normalized to some given length (~2.7kb in the original post) so that different transcripts are comparable. Otherwise you get a bias by transcript length.

friedue commented 7 years ago

but multiBigWigSummary should return the average coverage, not the sum? so that shouldn't be too prone to length-dependent artifacts? haven't given this a whole lot of thought though

dpryan79 commented 7 years ago

multiBigWigSummary will produce the average, multiBamSummary the total number. The former would be fine then.

friedue commented 7 years ago

that's an interesting distinction between the two tools that I was only subconsciously aware of

rojasp commented 7 years ago

Hi guys, Thank you very much for all this comments. @dpryan79 , as I don't have any experience with Snakemake, I'm going to start as @friedue suggest and I will let you now. The idea is to be able to create a plot as follow

TR.pdf

I will have all the consideration that you highlights.

deeptools / deepTools

question about scale-region --unscaled5prime for traveling ratios #483