bxlab / bx-python

Tools for manipulating biological data, particularly multiple sequence alignments
MIT License
149 stars 54 forks source link

Missing data from BigWig files #13

Open Carldeboer opened 7 years ago

Carldeboer commented 7 years ago

I have a program that uses bx-python to extract data from a bigwig file for specific BED loci. For some reason certain chromosomes are sometimes skipped for having no data (BigWigFile.get_as_array( chrom, startPos, endPos ) returns None). Looking at the contents of the bigwig file manually or on IGV indicates that there is in fact data for the "missing" chromosomes: This is an example peak that is being skipped: chr6 157355 157505 chr6:157355-157505 Same region on IGV: image

Looking at this region on the command line:

$ bigWigToBedGraph BWs/B_30.q30.bw  temp.out -chrom=chr6 -start=157355 -end=157505
$ cat temp.out
chr6    157361  157362  2
chr6    157363  157364  2
chr6    157372  157373  1
chr6    157388  157389  1
chr6    157392  157393  1
chr6    157403  157404  1
chr6    157413  157414  1
chr6    157415  157416  1
chr6    157419  157420  1
chr6    157424  157425  1
chr6    157432  157435  1
chr6    157439  157440  1
chr6    157443  157444  1
chr6    157477  157478  1
chr6    157491  157492  1
chr6    157493  157494  2
chr6    157504  157505  1

The following code is enough to reproduce the issue:

import bx
print(bx.__file__)
from bx.bbi.bigwig_file import BigWigFile
inFile1 = BigWigFile(open("BWs/B_30.q30.bw"))
values = inFile1.get_as_array( "chr6", 157355, 157505 )
print(values)

and produces this output:

$HOME/.local/lib/python2.7/site-packages/bx_python-0.8.0-py2.7-linux-x86_64.egg/bx/__init__.pyc
None

Note that None appears to only be returned when a chromosome is not found; when the chromsome is found and there is no data for the region, it returns a list of nans.

I can provide the BW if that would be helpful. I need somewhere to put it (150MB).

krejciadam commented 5 years ago

Just encountered the same issue. I think the problem is with chromosome count: I have a hg38 bigwig file from UCSC. This file contains all the alternative chromosomes like chr4_KI270925v1_alt. bxpython's get_as_array() "sees" only chromosomes 1 to 5 and then 10 to 22. Investigating this a little more carefully, it shows up that if chromosomes are alphabetically ordered, bxpython only "sees" exactly the first 256 chromosomes - it finds everything up to "chr5_GL383532v1_alt" (256th chromosome) and does not see anything from "chr5_GL949742v1_alt" (257th chromosome)