Closed radaniba closed 10 years ago
I guess one can specify the output and flush it when the program ends from within the program itself, but in case he doesn't, it owuld be good if the pybedtools 'remembers' all the files generated on a given session and when the prog exits it just clear those out
Yep, any tempfiles created are automatically cleaned up when the Python interpreter exits. Specifically, the last line in helpers.py registers helpers.cleanup()
to be called upon exit.
Within a single session, you can always call pybedtools.cleanup()
to get rid of any files created so far in that session.
By default, cleanup()
only gets rid of the files in BedTool.TEMPFILES
so that if other users on the same filesystem are using pybedtools, their files won't get deleted inadvertently. But you don't care about that, you can use pybedtools.cleanup(remove_all=True)
to get rid of anything matching $TEMPDIR/pybedtools.*.tmp
. But this could be slow if you have hundreds of thousands of files; see below for a solution to this.
If you kill a running Python process that created a lot of tempfiles, cleanup()
will never run, and that can cause temp files to accumulate. For example, in the past I've killed a running Python process that was doing a lot of randomizations using multiple processors. This resulted in a LOT (hundreds of thousands) of tempfiles that never got cleaned up from a normal exit. From a terminal, rm /tmp/pybedtools.*.tmp
gave an "argument list too long" error. The solution was to use find
and xargs
, as described here.
Also note that if you're creating new tempfiles across multiple processes, the list of tempfiles is not shared across process boundaries. That's why the functions in stats.py are careful about deleting files as they go.
hmm good to know, thanks @daler for explaining this
The reason I am asking is that I was thinking a malformedBedLineError would be caused by that
coverage_result = alignment.genome_coverage(genome="hg19")
coverage_result.head(100)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "../python2.7/site-packages/pybedtools/bedtool.py", line 1036, in head
for i, line in enumerate(iter(self)):
File "cbedtools.pyx", line 680, in pybedtools.cbedtools.IntervalFile.__next__ (pybedtools/cbedtools.cpp:8685)
pybedtools.cbedtools.MalformedBedLineError: malformed line: ['1', '30', '22', '249250621', '8.82646e-08']
This is the section from the temp file
1 28 52 249250621 2.08625e-07
1 29 83 249250621 3.32998e-07
1 30 22 249250621 8.82646e-08
1 31 59 249250621 2.3671e-07
1 32 29 249250621 1.16349e-07
1 33 23 249250621 9.22766e-08
This looks good for me though, and it used to work before, I don't really understand the reason of such message
any idea ?
ps : I cleaned all temp files before running
Ah, that's because your start coord is greater than your stop coord for the third line in your example.
See this recent BEDtools mailing list post for details.
But that's not supposed to be coordinates, that's pybedtools.genome_coverage called with no bg or bga option it returns chromosome, depth, number of reads, size, fraction,
no ?
Sorry, I missed that. In that case, this is similar to issue #110, where it's not actually a valid BED/GTF/GFF/VCF/BAM format file.
The problem here is that sometimes BedTool.genome_coverage
(i.e. bedtools genomecov
) returns a valid bedGraph file (if you use -d
, -bg
, -bga
) and sometimes not (as in the default).
I suppose I could manually check for which parameters were passed, and detect whether a file will be formatted to work nicely with a BedTool
object. If so, return a BedTool
object. But if the default settings are used, what should be returned if not a BedTool
object?
So far, I've chosen to not pay attention to kwargs passed and just always return a BedTool
object, relying on the user to decide if their file is a valid format or not. But I'm certainly open to suggestions for how to improve this.
Hmm, I see, well I guess this could be solved with another function similar to #110 but instead of returning a BedTool + dataframe, this will return 2 dataframes instead.
In general, I think it is better to place a watcher kind of function, something that checks if kwargs are provided then the object saved will be a BedTool, otherwise, the object will be any exploitable / parsable kind of data, a DataFrame will be ideal
I guess for now I can play with the solution provided in #110 , but that's good to know, thanks for clarifying this @daler
btw, is there another utility similar to pybedtools.create_interval_from_list ? pubedtools.create_csv ??
What are you aiming to do? If you'd like a CSV version of a BedTool
, you could use the new to_dataframe
method:
import pybedtools
a = pybedtools.example_bedtool('a.bed')
a.to_dataframe().to_csv('output.csv')
Is that merged in the master branch ? Should I pull the repo again ?
Yep -- I committed it yesterday after you said the method I proposed would work for your purposes. It's in the master branch now.
Awesome, thanks a lot @daler, I will update, I am generating a couple of examples on pybedtools usage and will be publishing some runnable examples at CodersCrowd soon
OK. Closing this for now, but feel free to re-open if needed. Also, I opened #113 for detecting valid BedTool output as you mentioned.
:+1:
When using pybedtools we are creating temporary files for access later on, when a call is done to one of pybedtools function (the so referenced x.fn)
My question is, sometimes these files can be big, and this will continue to pileup in the /tmp of the user or on servers.
Is there a way of flushing these temp files when the program exits, I don't think it is reasonable to flush while the program is still running but it is definitely useful to clean up a little bit after doing stuff
any thoughts ?
Thanks