daler / pybedtools

Python wrapper -- and more -- for BEDTools (bioinformatics tools for "genome arithmetic")
http://daler.github.io/pybedtools
Other
306 stars 102 forks source link

Is there a flushing process for temp files #112

Closed radaniba closed 10 years ago

radaniba commented 10 years ago

When using pybedtools we are creating temporary files for access later on, when a call is done to one of pybedtools function (the so referenced x.fn)

My question is, sometimes these files can be big, and this will continue to pileup in the /tmp of the user or on servers.

Is there a way of flushing these temp files when the program exits, I don't think it is reasonable to flush while the program is still running but it is definitely useful to clean up a little bit after doing stuff

any thoughts ?

Thanks

radaniba commented 10 years ago

I guess one can specify the output and flush it when the program ends from within the program itself, but in case he doesn't, it owuld be good if the pybedtools 'remembers' all the files generated on a given session and when the prog exits it just clear those out

daler commented 10 years ago

Yep, any tempfiles created are automatically cleaned up when the Python interpreter exits. Specifically, the last line in helpers.py registers helpers.cleanup() to be called upon exit.

Within a single session, you can always call pybedtools.cleanup() to get rid of any files created so far in that session.

By default, cleanup() only gets rid of the files in BedTool.TEMPFILES so that if other users on the same filesystem are using pybedtools, their files won't get deleted inadvertently. But you don't care about that, you can use pybedtools.cleanup(remove_all=True) to get rid of anything matching $TEMPDIR/pybedtools.*.tmp. But this could be slow if you have hundreds of thousands of files; see below for a solution to this.

If you kill a running Python process that created a lot of tempfiles, cleanup() will never run, and that can cause temp files to accumulate. For example, in the past I've killed a running Python process that was doing a lot of randomizations using multiple processors. This resulted in a LOT (hundreds of thousands) of tempfiles that never got cleaned up from a normal exit. From a terminal, rm /tmp/pybedtools.*.tmp gave an "argument list too long" error. The solution was to use find and xargs, as described here.

Also note that if you're creating new tempfiles across multiple processes, the list of tempfiles is not shared across process boundaries. That's why the functions in stats.py are careful about deleting files as they go.

radaniba commented 10 years ago

hmm good to know, thanks @daler for explaining this

The reason I am asking is that I was thinking a malformedBedLineError would be caused by that


coverage_result = alignment.genome_coverage(genome="hg19")

coverage_result.head(100)

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "../python2.7/site-packages/pybedtools/bedtool.py", line 1036, in head
    for i, line in enumerate(iter(self)):
  File "cbedtools.pyx", line 680, in pybedtools.cbedtools.IntervalFile.__next__ (pybedtools/cbedtools.cpp:8685)
pybedtools.cbedtools.MalformedBedLineError: malformed line: ['1', '30', '22', '249250621', '8.82646e-08']

This is the section from the temp file

1   28  52  249250621   2.08625e-07
1   29  83  249250621   3.32998e-07
1   30  22  249250621   8.82646e-08
1   31  59  249250621   2.3671e-07
1   32  29  249250621   1.16349e-07
1   33  23  249250621   9.22766e-08

This looks good for me though, and it used to work before, I don't really understand the reason of such message

any idea ?

ps : I cleaned all temp files before running

daler commented 10 years ago

Ah, that's because your start coord is greater than your stop coord for the third line in your example.

See this recent BEDtools mailing list post for details.

radaniba commented 10 years ago

But that's not supposed to be coordinates, that's pybedtools.genome_coverage called with no bg or bga option it returns chromosome, depth, number of reads, size, fraction,

no ?

daler commented 10 years ago

Sorry, I missed that. In that case, this is similar to issue #110, where it's not actually a valid BED/GTF/GFF/VCF/BAM format file.

The problem here is that sometimes BedTool.genome_coverage (i.e. bedtools genomecov) returns a valid bedGraph file (if you use -d, -bg, -bga) and sometimes not (as in the default).

I suppose I could manually check for which parameters were passed, and detect whether a file will be formatted to work nicely with a BedTool object. If so, return a BedTool object. But if the default settings are used, what should be returned if not a BedTool object?

So far, I've chosen to not pay attention to kwargs passed and just always return a BedTool object, relying on the user to decide if their file is a valid format or not. But I'm certainly open to suggestions for how to improve this.

radaniba commented 10 years ago

Hmm, I see, well I guess this could be solved with another function similar to #110 but instead of returning a BedTool + dataframe, this will return 2 dataframes instead.

In general, I think it is better to place a watcher kind of function, something that checks if kwargs are provided then the object saved will be a BedTool, otherwise, the object will be any exploitable / parsable kind of data, a DataFrame will be ideal

I guess for now I can play with the solution provided in #110 , but that's good to know, thanks for clarifying this @daler

radaniba commented 10 years ago

btw, is there another utility similar to pybedtools.create_interval_from_list ? pubedtools.create_csv ??

daler commented 10 years ago

What are you aiming to do? If you'd like a CSV version of a BedTool, you could use the new to_dataframemethod:

import pybedtools
a = pybedtools.example_bedtool('a.bed')
a.to_dataframe().to_csv('output.csv')
radaniba commented 10 years ago

Is that merged in the master branch ? Should I pull the repo again ?

daler commented 10 years ago

Yep -- I committed it yesterday after you said the method I proposed would work for your purposes. It's in the master branch now.

radaniba commented 10 years ago

Awesome, thanks a lot @daler, I will update, I am generating a couple of examples on pybedtools usage and will be publishing some runnable examples at CodersCrowd soon

daler commented 10 years ago

OK. Closing this for now, but feel free to re-open if needed. Also, I opened #113 for detecting valid BedTool output as you mentioned.

radaniba commented 10 years ago

:+1: