deeptools / pyBigWig

A python extension for quick access to bigWig and bigBed files
MIT License
218 stars 49 forks source link

IGV fails to read bigWigs generated by pyBigWig #89

Closed pbenner closed 5 years ago

pbenner commented 5 years ago

The following script generates a bigWig file filles with random numbers:

#! /usr/bin/env python3.7        

import pyBigWig
import numpy as np

genome = {
      'chr1':  248956422,
      'chr2':  242193529,
      'chr3':  198295559,
      'chr4':  190214555,
      'chr5':  181538259,
      'chr6':  170805979,
      'chr7':  159345973,
      'chr8':  145138636,
      'chr9':  138394717,
     'chr10':  133797422,
     'chr11':  135086622,
     'chr12':  133275309,
     'chr13':  114364328,
     'chr14':  107043718,
     'chr15':  101991189,
     'chr16':   90338345,
     'chr17':   83257441,
     'chr18':   80373285,
     'chr19':   58617616,
     'chr20':   64444167,
     'chr21':   46709983,
     'chr22':   50818468,
      'chrX':  156040895 }

filename = "test.bw"
binsize  = 200

bw = pyBigWig.open(filename, "w")
bw.addHeader(list(genome.items()), maxZooms=0)

for seqname, length in genome.items():
    bw.addEntries(seqname, 0, values=np.random.rand(length//binsize), span=binsize, step=binsize)
bw.close()

IGV fails to read this bigWig file with the following error message:

Error loading test.bw: Index 10000 out of bounds for length 10000

IGV is able to read the bigWig file if it is first converted to wig and then back to bigWig: $ bigWigToWig test.bw test.wig $ wigToBigWig test.wig genome.txt test2.bw It seems that pyBigWig/libBigWig is generating a slightly different bigWig file than the standard UCSC tools, which IGV cannot handle.

There are also other differences to the standard UCSC/Kent library (since there is no actual standard for this format, I would regard the UCSC implementation as the de facto standard):

dpryan79 commented 5 years ago

IGV and a few other browsers require zoom levels, which is why they're there by default. I should probably document that in the readme.

dpryan79 commented 5 years ago

BTW, I think UCSC requires entries in the same (in their case alphabetical) order as the header, which I've partially copied. The check of nBasesCovered is meant to by-pass further processing of the file, since, unlike UCSC, it's possible here to programmatically create empty files. This was requested by users who were processing large batches of single-cell data that sometimes had no data but wanted files to be created either way.

pbenner commented 5 years ago

IGV and a few other browsers require zoom levels, which is why they're there by default. I should probably document that in the readme.

Thanks, that's good to know!

BTW, I think UCSC requires entries in the same (in their case alphabetical) order as the header, which I've partially copied.

What I meant was that every chromosome gets an ID (and this must be sorted especially in the index RTree structure), but as far as I know it is possible to write the actual raw data in any order (there is no requirement that dataOffset_i > dataOffset_j if ID_i > ID_j). It would just make the library more user friendly if this restriction was dropped.

dpryan79 commented 5 years ago

I've updated the documentation about zoom levels a bit.