Document performance considerations?

deeptools / pyBigWig

A python extension for quick access to bigWig and bigBed files

MIT License

212 stars 48 forks source link

Document performance considerations? #125

Open alexlenail opened 2 years ago

alexlenail commented 2 years ago

I'd like to use pyBigWig to collect values at many intervals from many bigwigs, and I'd love to know what's performant.

is there overhead to opening a bigwig with pyBigWig? i.e. what's the runtime difference between:

with pyBigWig.open(bigwig_file) as bw:
    for chrom, start, stop in intervals:
        bw.values(chrom, start, stop)

and

for chrom, start, stop in intervals:
    with pyBigWig.open(bigwig_file) as bw:
        bw.values(chrom, start, stop)

If the former is optimal, is there any advantage to the intervals being sorted?
Do you know relative performance of pyBigWig entries() queries of bigBed files versus tabix queries of gzipped bed files?

gokceneraslan commented 2 years ago

I think a vectorized version of bw.values would be much better e.g.

bw.values(np.array([chrom]*3), np.array([79250, 86700, 87277]), np.array([80250, 87700, 88277]), numpy=True)

which returns a list of numpy arrays, without iterating over the intervals in a loop. But I guess this is not implemented yet.

alexlenail commented 2 years ago

@dpryan79 what is the fastest way to get arrays of values from a bigwig file for each of many genomic intervals (i.e. entries in a bed file)?

BradBalderson commented 5 months ago

For others, I found a better solution for the above-described task was to use the bigWigAverageOverBed tool from UCSC.

BradBalderson commented 5 months ago

$ ./bigWigAverageOverBed

bigWigAverageOverBed v2 - Compute average score of big wig over each bed, which may have introns.
usage:
   bigWigAverageOverBed in.bw in.bed out.tab
The output columns are:
   name - name field from bed, which should be unique
   size - size of bed (sum of exon sizes
   covered - # bases within exons covered by bigWig
   sum - sum of values over all bases covered
   mean0 - average over bases with non-covered bases counting as zeroes
   mean - average over just covered bases
Options:
   -stats=stats.ra - Output a collection of overall statistics to stat.ra file
   -bedOut=out.bed - Make output bed that is echo of input bed but with mean column appended
   -sampleAroundCenter=N - Take sample at region N bases wide centered around bed item, rather
                     than the usual sample in the bed item.
   -minMax - include two additional columns containing the min and max observed in the area.