arq5x / bedtools2

bedtools - the swiss army knife for genome arithmetic
MIT License
941 stars 287 forks source link

The `bedtools multiinter -header` option outputs broken BED #1085

Open dbolser opened 7 months ago

dbolser commented 7 months ago

Hi, Thanks for creating bedtools @arq5x !

When I use: bedtools multiinter -header -i a.bed b.bed

the resulting output looks like this:

chrom   start   end     num     list    a.bed        b.bed
chr1    16366   16374   1       1       1       0
chr1    18235   18249   1       2       0       1
chr1    18461   18475   1       2       0       1
chr1    25597   25605   1       1       1       0
chr1    34568   34576   1       1       1       0
chr1    36631   36645   1       2       0       1
chr1    42427   42435   1       1       1       0
chr1    44984   44992   1       1       1       0
...

Note that this is not bed format, which should be:

#chrom   start   end     num     list    a.bed        b.bed
chr1    16366   16374   1       1       1       0
chr1    18235   18249   1       2       0       1
chr1    18461   18475   1       2       0       1
chr1    25597   25605   1       1       1       0
chr1    34568   34576   1       1       1       0
chr1    36631   36645   1       2       0       1
chr1    42427   42435   1       1       1       0
chr1    44984   44992   1       1       1       0
...

At least I think that's what it should be ... I'm actually not sure now I check the BED specification...

HOWEVER, the lack of the # DOES break your own tools, e.g. multiinter itself expects the BED file header to start with #, and so does genomecov (and so does LiftOver from UCSC...)

So perhaps this bug is better described as "multiinter and genomecov fall over when given a valid bed header...", with a separate bug for LiftOver?

THx

dbolser commented 7 months ago

I don't want to do this:

time bedtools genomecov \
    -i <(echo -n '#' && bgzcat Data3/PWM/multiinter.bed.gz)
    -g ../../ENCFF792NJK.tsv \
    > genomecov-multiinter.tsv # hashtag sad face
dbolser commented 7 months ago

Also tabix doesn't like the header without a preceding #, e.g. although tabix -S 1 -p bed multiinter.bed.gz works, tabix -H multiinter.bed.gz doesn't (it does when you have the #) and as an extension, Python's pysam.TabixFile("multiinter.bed.gz").header also 'fails'.

So although nothing in BED formally requires a # for the header, it seems a lot of tooling has adopted that convention.