daler / pybedtools

Python wrapper -- and more -- for BEDTools (bioinformatics tools for "genome arithmetic")
http://daler.github.io/pybedtools
Other
297 stars 103 forks source link

`IndexError: list index out of range` with narrowPeak file #365

Closed csestili closed 2 years ago

csestili commented 2 years ago

hi, I'm having trouble using pybedtools to read from a narrowPeak file.

# example.narrowPeak file is from https://genome.ucsc.edu/FAQ/FAQformat.html#format12

In [16]: ! cat example.narrowPeak
track type=narrowPeak visibility=3 db=hg19 name="nPk" description="ENCODE narrowPeak Example"
browser position chr1:9356000-9365000
chr1    9356548 9356648 .       0       .       182     5.0945  -1  50
chr1    9358722 9358822 .       0       .       91      4.6052  -1  40
chr1    9361082 9361182 .       0       .       182     9.2103  -1  75

In [17]: import pybedtools

In [18]: peaks = pybedtools.BedTool('example.narrowPeak')

In [19]: peaks[0]
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
Input In [19], in <cell line: 1>()
----> 1 peaks[0]

File ~/miniconda3/envs/keras2-tf27/lib/python3.9/site-packages/pybedtools/bedtool.py:1262, in BedTool.__getitem__(self, key)
   1260     return islice(self, key.start, key.stop, key.step)
   1261 elif isinstance(key, int):
-> 1262     return list(islice(self, key, key + 1))[0]
   1263 else:
   1264     raise ValueError(
   1265         "Only slices or integers allowed for indexing " "into a BedTool"
   1266     )

File ~/miniconda3/envs/keras2-tf27/lib/python3.9/site-packages/pybedtools/cbedtools.pyx:793, in pybedtools.cbedtools.IntervalIterator.__next__()

File ~/miniconda3/envs/keras2-tf27/lib/python3.9/site-packages/pybedtools/cbedtools.pyx:657, in pybedtools.cbedtools.create_interval_from_list()

IndexError: list index out of range

I do not get this error when I read the contents of the same file as a string:

In [24]: peaks = pybedtools.BedTool("""
    ...: track type=narrowPeak visibility=3 db=hg19 name="nPk" description="ENCODE narrowPeak Example"
    ...: browser position chr1:9356000-9365000
    ...: chr1    9356548 9356648 .       0       .       182     5.0945  -1  50
    ...: chr1    9358722 9358822 .       0       .       91      4.6052  -1  40
    ...: chr1    9361082 9361182 .       0       .       182     9.2103  -1  75
    ...: """, from_string=True)

In [25]: peaks[0]
Out[25]: Interval(chr1:9356548-9356648)

In [26]: peaks[0].fields
Out[26]: ['chr1', '9356548', '9356648', '.', '0', '.', '182', '5.0945', '-1', '50']

In [27]: peaks[1].fields
Out[27]: ['chr1', '9358722', '9358822', '.', '0', '.', '91', '4.6052', '-1', '40']

so, I suppose I could just load it from string, as a workaround, but it would be nicer to be able to just feed in a filename. maybe I am just not calling it correctly?

daler commented 2 years ago

I was only able to reproduce this when the narrowPeak file was space-separated rather than tab-separated. If you convert spaces to tabs in your example file it should work. If that does not work though please reopen!

(the reason it works as string is because that method is specifically designed for quickly testing small files written in-line as string; it does extra work which is not useful on larger files in practice)