daler / pybedtools

Python wrapper -- and more -- for BEDTools (bioinformatics tools for "genome arithmetic")
http://daler.github.io/pybedtools
Other
297 stars 103 forks source link

Len modifying the Bedtools after a filter #400

Closed andreforesight closed 4 months ago

andreforesight commented 7 months ago

I am filtering a bed on length, but when I look at the bed length after that it makes the bed empty. A small reproducible example is:

#!/usr/bin/python3

import pybedtools
print(pybedtools.__file__)

open("a.bed","w").write("chr1\t1\t100\nchr2\t1\t1000\n")

a = pybedtools.BedTool("a.bed")
print(len(a))
print(len(a))
a = a.filter(lambda x: x.length > 500)
print(len(a))
print(len(a))

Yields

2
2
1
0

So the first length after that works, but the one after breaks it. If I add a saveas it seems to work:

#!/usr/bin/python3

import pybedtools
print(pybedtools.__file__)

open("a.bed","w").write("chr1\t1\t100\nchr2\t1\t1000\n")

a = pybedtools.BedTool("a.bed")
print(len(a))
print(len(a))
a = a.filter(lambda x: x.length > 500)
a = a.saveas()
print(len(a))
print(len(a))

Yields the expected

2
2
1
1
zoomlion commented 5 months ago

I've met the same issue as you have mentioned. filter function may result in mistakes. It may results from generator feature. See here: [https://daler.github.io/pybedtools/FAQs.html#i-m-getting-an-empty-bedtool]()

daler commented 4 months ago

Right, this is expected behavior which can save a lot on I/O and memory. Checking the length will consume the generator. but if you want to store a copy then you need to use .saveas() as you've done in your example.