daler / pybedtools

Python wrapper -- and more -- for BEDTools (bioinformatics tools for "genome arithmetic")
http://daler.github.io/pybedtools
Other
297 stars 103 forks source link

Bed identified as SAM #370

Closed dprat closed 1 year ago

dprat commented 1 year ago

Hi, have been strugling with an issue of the same type as #363 At some point I need to merge a bedfile and I end up with up a bedfile containing 11 column. I then need to separate some line of the BedTools object and then cat() them. For some reason the line are considered as "sam" instead of bed format. After digging in this issue I tried to check over this :

if (
        (len(fields) >= 11)
        and isdigit(fields[1])
        and isdigit(fields[3])
        and isdigit(fields[4])
        and (fields[5] not in ['.', '+', '-'])
    ):

But I haven't been able to make it work even by commenting the whole 'SAM' part and after recompiling it, the line that were identified as 'sam' are still considered as 'sam'

But at 1112 in bedtool.py:

self._file_type = six.advance_iterator(iter(self)).file_type

If I comment this line, then the attribute _file_type is not "sam" but stay None and this final solve my problem. But is there a good way to fix this ? I haven't been able to solve it in a proper way.

I'm sorry I actually haven't been able to reproduce this error in an easy way.

Here is the result I obtain:

chr1    710330  710331
*   0   0
chr1    3935209 3935210

When this is the result I want:

chr1    3935209 3935210 3935209,3935209,3935209 3935210,3935210,3935210 3935209,3935209,3935209 3935210,3935210,3935210 +   hg38    5   170,171,172

edit :

After digging a bit I finally managed to isolate the problem:

import pybedtools
import os, difflib, sys
import six

testdir = os.path.dirname(__file__)
tempdir = os.path.join(os.path.abspath(testdir), "tmp")
unwriteable = "unwriteable"

def test_malformed():
    """
    Malformed BED lines should raise MalformedBedLineError
    """

    a = pybedtools.BedTool(
        """
        chr1    3787399 3787400 3787399 3787400 3787399 3787400 +   hg38    5   169
        chr1    3935209 3935210 3935209,3935209,3935209 3935210,3935210,3935210 3935209,3935209,3935209 3935210,3935210,3935210 +   hg38    5   170,171,172

        """,
        from_string=True,
    )
    a_i = iter(a)

    print(six.advance_iterator(a_i).file_type)
    assert False

In this case i end up with a _file_type = "sam"


F                                                                                                                                                                                  [100%]
======================================================================================== FAILURES ========================================================================================
_____________________________________________________________________________________ test_malformed _____________________________________________________________________________________

    def test_malformed():
        """
        Malformed BED lines should raise MalformedBedLineError
        """

        a = pybedtools.BedTool(
            """
            chr1    3787399 3787400 3787399 3787400 3787399 3787400 +   hg38    5   169
            chr1    3935209 3935210 3935209,3935209,3935209 3935210,3935210,3935210 3935209,3935209,3935209 3935210,3935210,3935210 +   hg38    5   170,171,172

            """,
            from_string=True,
        )
        a_i = iter(a)

        print(six.advance_iterator(a_i).file_type)
>       assert False
E       assert False

test/test_malformed_bed.py:25: AssertionError
---------------------------------------------------------------------------------- Captured stdout call ----------------------------------------------------------------------------------
sam
================================================================================ short test summary info =================================================================================
FAILED test/test_malformed_bed.py::test_malformed - assert False
1 failed in 0.19s

But if i invert the two line of my bed then i obtain a _file_type = "bed"

daler commented 1 year ago

So when the first line's fourth field is 3935209,3935209,3935209, according to the SAM heuristic:

    if (
        (len(fields) >= 11)
        and isdigit(fields[1])
        and isdigit(fields[3])
        and isdigit(fields[4])
        and (fields[5] not in ['.', '+', '-'])
    ):

isdigit(fields[3]) is false. So it's detected as a BED.

But when the first line's fourth field is 3787399,isdigit(fields[3]) is True so it's detected as SAM, and all of the other criteria for SAM also match.

So this is a corner case where you happen to have an 11-field BED file where the name field is an integer. If you have a better way of detecting these kinds of corner cases directly that doesn't have a performance impact, that would be great to have as a PR. As a workaround, you could try adding a field to your bed so you have more than 11 fields, which should make it fail the SAM criteria.