daler / pybedtools

Python wrapper -- and more -- for BEDTools (bioinformatics tools for "genome arithmetic")
http://daler.github.io/pybedtools
Other
297 stars 103 forks source link

Handling of wide pseudo-bed files (pseudo-bed files interpreted as .sam)? #363

Closed semenko closed 2 years ago

semenko commented 2 years ago

I ran into an issue that took me a bit to debug --

I'm parsing a ~pseudo .bed file that looks like this:

chr1 1 10 0 1 2 3 4 5 6 7 8 9 ...

When there are 11 or more total columns, pybedtools considers this a sam and some commands error, e.g. if column 4 is zero.

>>> print(BedTool("chr1 1 10 0 1 2 3 4 5 6 7", from_string=True).remove_invalid())

>>> print(BedTool("chr1 1 10 0 1 2 3 4 5 6", from_string=True).remove_invalid())
chr1    1   10  0   1   2   3   4   5   6

>>> print(BedTool("chr1 1 10 1 1 2 3 4 5 6 7", from_string=True).remove_invalid())
chr1    1   10  1   1   2   3   4   5   6   7

I ran into this using the .cut() helper, which threw a nebulous error:

>>> BedTool("chr1 1 10 0 1 2 3 4 5 6 7", from_string=True).cut([0,1,2])
…
OverflowError: can't convert negative value to CHRPOS

This took me a while to figure out -- my file is named .bed -- and only rare lines where column 4 == 0 are considered invalid in my data. (Plus, bedtools itself handles these files just fine -- I'm running a map on some of the colums much higher than position 7.)

Any thoughts on this, or perhaps warning users (e.g. if the file extension is .bed but file_type is interpreted as sam)?

Is there a way to override BedTool interpreting this as .sam?

Thanks for developing & maintaining this amazing package!

daler commented 2 years ago

The logic for detecting SAM format is here, that is, it has nothing to do with the file name but rather looks at the contents:

    if (
        (len(fields) >= 11)
        and isdigit(fields[1])
        and isdigit(fields[3])
        and isdigit(fields[4])
        and (fields[5] not in ['.', '+', '-'])
    ):

So it seems like your particular files are looking SAM-like enough to be called as such. Since the filetype is determined and set on the cython side, setting the filetype manually is not curently straightforward. If you are able to modify your input files (say, by changing the 6th position to be in ., +, or -), that will be easiest. For example, the following is considered a BED:

BedTool("chr1 1 10 0 1 . 2 3 4 5 6 7", from_string=True).cut([0,1,2])

By the way, that error is because SAM format is 1-based, but Intervalobjects are zero-based. So it was trying to subtract 1 from the 0 (which would be the position if it were a SAM file), causing an overflow from the resulting negative value.

semenko commented 2 years ago

Thanks!

Yeah, this is definitely an edge case — I'm refactoring some old bash scripts and this was the first situation I've seen where bedtools & pybedtools don't behave the same (i.e. bedtools map handles this just fine).

Feel free to close this if there's no easy solution and/or don't think a warning to the user is appropriate.