deeptools / deepTools

Tools to process and analyze deep sequencing data.
Other
677 stars 208 forks source link

computeMatrix with BED instead of bigWig file #406

Closed igordot closed 8 years ago

igordot commented 8 years ago

Is it possible to use computeMatrix with scores (--scoreFileName) as BED instead of bigWig file? For example, if I want to see how my ChIP-seq peaks are distributed around TSS. It should make the calculation a lot quicker.

friedue commented 8 years ago

why don't you simply turn your BED file into a bigWig file?

  1. make a bedGraph file with four columns out of your peaks' bed file by simply extracting the following four columns: chromosome, start, end, score (e.g., -10xlog(p-value)) [you could do this on the command line using cut or awk, or simply in Excel]
  2. use the UCSC tool bedGraphToBigWig
dpryan79 commented 8 years ago

BED files don't lend themselves to random querying, which computeMatrix needs (there are ways around this, but getting everyone to tabix index their files is probably a non-starter). If, for some reason, you don't want to use bedGraphToBigWig, I can show you a few lines of python that will perform the conversion from the BED file (you already have the prerequisite python modules installed, since deepTools uses them as well).

igordot commented 8 years ago

I could convert to bedGraph and then bigWig, but that's two extra steps. Yes, they are simple, but I was hoping there would is a more elegant solution possible.

Regarding random querying of BED files, is it that big of an issue? Usually BED files are relatively small and can be easily loaded into memory.

dpryan79 commented 8 years ago

Essentially all parts of deepTools rely on random querying of files to work, so getting around that would require a fair bit of effort (and the accompanying maintenance overhead). Having said that, we could presumably use the deeptoolsintervals module to read the whole file in and allow random querying (I think I'm storing the score already). I already added a special "remote wig/bedGraph files on deepBlue" method in version 2.4, so I suppose BED would be doable too. I'll think about this more tomorrow.

igordot commented 8 years ago

Thank you for the prompt feedback. I tried converting BEDs to bigWig. After I added --missingDataAsZero to the computeMatrix step, the resulting TSS profile plots for full genome bigWigs and BED-based bigWigs look very similar.

Although the input bigWigs are now much smaller, the processing time is not much quicker. I guess most of the computation happens at a different stage.

dpryan79 commented 8 years ago

Glad that worked. For what it's worth, the time needed by computeMatrix is a function of the genome size. There's not much of a speed benefit from having only less data.

igordot commented 8 years ago

Thanks for the clarification!

dpryan79 commented 8 years ago

I just sat down to play around with implementing this and realized that there's no good way to write a "give me a list of chromosomes and their lengths" method. That's a deal breaker given how the rest of deepTools works internally. At the moment this will be classified as "won't implement", though if I come up with an elegant way to incorporate it in the future I will.

igordot commented 8 years ago

I suppose you could use the BED file to get the list of chromosomes (uniques from col1) and lengths (max of col3), but that's a big approximation, so it makes sense not to do that.