daler / pybedtools

Python wrapper -- and more -- for BEDTools (bioinformatics tools for "genome arithmetic")
http://daler.github.io/pybedtools
Other
310 stars 102 forks source link

pipe blocking #49

Open daler opened 12 years ago

daler commented 12 years ago

For streaming intersections of moderate-sized files (say, >5000 features), the following blocks::

z = a.intersect(b, stream=True).intersect(c, stream=True)
len(z)

The schematic below shows what's happening with stdin/stdout and pipes. The above command hangs when trying to write to the stdin of the second process, marked below as ^^^^^^.


    FILE -> stdin-|------------------|-stdout  -> PIPE ->  stdin-|------------------|-stdout -> PIPE -> IntervalIterator
                  | intersectBed (1) |                           | intersectBed (2) |
                  |------------------|-stderr         ^^^^^^     |------------------|-stderr

Despite a forced flush of stdout of command (1) and stdin of command (2) in helpers.call_bedtools,as well as forcing flush of stdout in command (2) in the IntervalIterator, this still blocks.

In the Popen command, setting bufsize=1 or bufsize=0 doesn't help. Docs for Popen.communicate() say that it'll block for large input.

Various stackoverflow answers for similar problems describe the solution to this as using separate threads for each call, however, initial tests make interactive work in IPython a little crazy.

My guess is that workarounds like "rendering" a streaming BedTool to disk will be needed for the near-to-mid-future, since fixes to this will be difficult.

daler commented 8 years ago

Try the select module for non-blocking IO, as suggested by John in this biostars question