Open alexlenail opened 2 years ago
I think a vectorized version of bw.values would be much better e.g.
bw.values(np.array([chrom]*3), np.array([79250, 86700, 87277]), np.array([80250, 87700, 88277]), numpy=True)
which returns a list of numpy arrays, without iterating over the intervals in a loop. But I guess this is not implemented yet.
@dpryan79 what is the fastest way to get arrays of values from a bigwig file for each of many genomic intervals (i.e. entries in a bed file)?
For others, I found a better solution for the above-described task was to use the bigWigAverageOverBed tool from UCSC.
$ ./bigWigAverageOverBed
bigWigAverageOverBed v2 - Compute average score of big wig over each bed, which may have introns.
usage:
bigWigAverageOverBed in.bw in.bed out.tab
The output columns are:
name - name field from bed, which should be unique
size - size of bed (sum of exon sizes
covered - # bases within exons covered by bigWig
sum - sum of values over all bases covered
mean0 - average over bases with non-covered bases counting as zeroes
mean - average over just covered bases
Options:
-stats=stats.ra - Output a collection of overall statistics to stat.ra file
-bedOut=out.bed - Make output bed that is echo of input bed but with mean column appended
-sampleAroundCenter=N - Take sample at region N bases wide centered around bed item, rather
than the usual sample in the bed item.
-minMax - include two additional columns containing the min and max observed in the area.
I'd like to use pyBigWig to collect values at many intervals from many bigwigs, and I'd love to know what's performant.
and
If the former is optimal, is there any advantage to the
intervals
being sorted?Do you know relative performance of pyBigWig
entries()
queries of bigBed files versus tabix queries of gzipped bed files?