IGV fails to read bigWigs generated by pyBigWig

pbenner commented 5 years ago

The following script generates a bigWig file filles with random numbers:

#! /usr/bin/env python3.7        

import pyBigWig
import numpy as np

genome = {
      'chr1':  248956422,
      'chr2':  242193529,
      'chr3':  198295559,
      'chr4':  190214555,
      'chr5':  181538259,
      'chr6':  170805979,
      'chr7':  159345973,
      'chr8':  145138636,
      'chr9':  138394717,
     'chr10':  133797422,
     'chr11':  135086622,
     'chr12':  133275309,
     'chr13':  114364328,
     'chr14':  107043718,
     'chr15':  101991189,
     'chr16':   90338345,
     'chr17':   83257441,
     'chr18':   80373285,
     'chr19':   58617616,
     'chr20':   64444167,
     'chr21':   46709983,
     'chr22':   50818468,
      'chrX':  156040895 }

filename = "test.bw"
binsize  = 200

bw = pyBigWig.open(filename, "w")
bw.addHeader(list(genome.items()), maxZooms=0)

for seqname, length in genome.items():
    bw.addEntries(seqname, 0, values=np.random.rand(length//binsize), span=binsize, step=binsize)
bw.close()

IGV fails to read this bigWig file with the following error message:

Error loading test.bw: Index 10000 out of bounds for length 10000

IGV is able to read the bigWig file if it is first converted to wig and then back to bigWig: $ bigWigToWig test.bw test.wig $ wigToBigWig test.wig genome.txt test2.bw It seems that pyBigWig/libBigWig is generating a slightly different bigWig file than the standard UCSC tools, which IGV cannot handle.

There are also other differences to the standard UCSC/Kent library (since there is no actual standard for this format, I would regard the UCSC implementation as the de facto standard):

pyBigWig requires the user to add entries exactly in the same order as specified in the chromosome list, although there is no technical requirement for this
libBigWig checks nBasesCovered before reading (bwRead.c#L374), which the UCSC library does not

dpryan79 commented 5 years ago

IGV and a few other browsers require zoom levels, which is why they're there by default. I should probably document that in the readme.

dpryan79 commented 5 years ago

BTW, I think UCSC requires entries in the same (in their case alphabetical) order as the header, which I've partially copied. The check of nBasesCovered is meant to by-pass further processing of the file, since, unlike UCSC, it's possible here to programmatically create empty files. This was requested by users who were processing large batches of single-cell data that sometimes had no data but wanted files to be created either way.

pbenner commented 5 years ago

IGV and a few other browsers require zoom levels, which is why they're there by default. I should probably document that in the readme.

Thanks, that's good to know!

BTW, I think UCSC requires entries in the same (in their case alphabetical) order as the header, which I've partially copied.

What I meant was that every chromosome gets an ID (and this must be sorted especially in the index RTree structure), but as far as I know it is possible to write the actual raw data in any order (there is no requirement that dataOffset_i > dataOffset_j if ID_i > ID_j). It would just make the library more user friendly if this restriction was dropped.

dpryan79 commented 5 years ago

I've updated the documentation about zoom levels a bit.

deeptools / pyBigWig

IGV fails to read bigWigs generated by pyBigWig #89