deeptools / deepTools

Tools to process and analyze deep sequencing data.
Other
679 stars 212 forks source link

question about scale-region --unscaled5prime for traveling ratios #483

Closed rojasp closed 7 years ago

rojasp commented 7 years ago

Hi, Im trying to carry out a Pol II traveling ratio analysis ("The promoter-proximal bin is defined using a fixed window from 30 bp to +300 bp around the annotated start site. The transcribed region (gene body) bin is from +300 bp to the annotated end. The TR is the ratio of Pol II density in the promoter-proximal bin to the Pol II density in the transcribed region bin"). To do that, I normalized my bam file as bamCoverage -b file.BAM --normalizeUsingRPKM -o fileRPKM -of bigwig

Then I run computeMatrix scale-regions --beforeRegionStartLength 30 --unscaled5prime 330 -m 3000 -bs 30 -R tss -S fileRPKM -bl DACblacklist.bed.gz --skipZeros -o matrix_TR.gz

PlotProfile --matrixFile matrix_TR.gz --outFileSortedRegions sort_TR --averageType mean --yAxisLabel coverage --perGroup --outFileName Traveling_ratio --outFileNameData Traveling_ratio.tsv

But I am a bit lost to perform the further analysis. Could you give me any tip?

Thanks

friedue commented 7 years ago

I don't think there's a way to directly calculate and visualize the TR, unless there's some computeMatrixOperations magic that @dpryan79 may know more about.

If I understand correctly, you need to obtain two values per gene: (1) the promoter coverage and (2) the gene body coverage - and then you will have to calculate the ratio yourself. I haven't really played this through, but if you wanted to stay with deepTools for this, I'd be tempted to use multiBigWigSummary BED-file instead of computeMatrix.

  1. generate two BED files, one for the promoter regions, one for the gene bodies. this could be done either with awk or bedtools
  2. use multiBigWigSummary BED-file with either BED-file, this will give you a coverage score for each gene region (make sure yo use option --outRawCounts)
  3. if the order of the 2 BED files is the same, you could simply do paste promoterCounts.tab geneBodyCounts.tab | awk '{print $1/$2}' > TRs.tab (there may be some header lines you need to get rid of etc., haven't actually tested this)
dpryan79 commented 7 years ago

I've done something similar to this, but I've used a mixture of computeMatrix and the deepTools API.

Something along the following lines should work with your computeMatrix output (this is mostly copy-pasted from a Snakefile of mine, so you'll need to change things like input[0]):

from deeptools import heatmapper
import numpy as np
hm = heatmapper.heatmapper()
hm.read_matrix_file(input[0])

def doCalc(x, sampleWidth=0):
    o = [np.nan, np.nan]
    s1 = 0
    e1 = 12
    o[0] = np.nansum(x[s1:])
    o[1] = np.nansum(x[s1:e1]).astype('float') / np.nansum(x[e1:]).astype('float')
    return o
out = np.apply_along_axis(doCalc, 1, hm.matrix.matrix)

of = open(output[0], "w")
of.write("chrom\tstart\tend\tname\tscore\tstrand")
for label in hm.parameters['sample_labels']:
    of.write("\t{}_Sum\t{}_Ratio".format(label, label))
of.write("\n")

for reg, val in zip(hm.matrix.regions, out):
    of.write("{}\t{}\t{}\t{}\t{}\t{}\t".format(reg["chrom"], reg["start"], reg["end"], reg["name"], reg["score"], reg["strand"]))
    of.write("\t".join(["{}".format(x) for x in val]))
    of.write("\n")
of.close()

I used a more complicated version of that to calculate "5' loading" from GROseq data for one of our groups. You'll have to double check that what I wrote is correct, since I had to change it a bit to match what you're doing.

dpryan79 commented 7 years ago

BTW, what @friedue suggested is another nice option. Essentially you produce two matrices with one of the multi*Summary tools and then divide them in python (they're numpy matrices). Note that you'll need to normalize the gene body matrix yourself.

friedue commented 7 years ago

what do you mean by "normalizing the gene body matrix"?

dpryan79 commented 7 years ago

My presumption is that the "transcribed region" should be normalized to some given length (~2.7kb in the original post) so that different transcripts are comparable. Otherwise you get a bias by transcript length.

friedue commented 7 years ago

but multiBigWigSummary should return the average coverage, not the sum? so that shouldn't be too prone to length-dependent artifacts? haven't given this a whole lot of thought though

dpryan79 commented 7 years ago

multiBigWigSummary will produce the average, multiBamSummary the total number. The former would be fine then.

friedue commented 7 years ago

that's an interesting distinction between the two tools that I was only subconsciously aware of

rojasp commented 7 years ago

Hi guys, Thank you very much for all this comments. @dpryan79 , as I don't have any experience with Snakemake, I'm going to start as @friedue suggest and I will let you now. The idea is to be able to create a plot as follow

TR.pdf

I will have all the consideration that you highlights.