arq5x / bedtools

A powerful toolset for genome arithmetic.
http://code.google.com/p/bedtools/
GNU General Public License v2.0
139 stars 86 forks source link

bedSort fails for 0 length features #123

Open bernt-matthias opened 6 years ago

bernt-matthias commented 6 years ago

bedSort outputs the following for the SNPs dataset from UCSC

...
chr22   17586594    17586595    rs34484815  0   +
chr22   17586605    17586605    rs536619616 0   +
chr22   17586604    17586605    rs560126106 0   +
...

I guess the problem are 0 length features which do not make sense. But bedtools should still output sorted data.

bernt-matthias commented 6 years ago

The note from UCSC on the validity of 0 length SNPs:

We consider point insertions into the genome to be zero length features. You can see the SNP in question in the following Genome Browser view: http://genome.ucsc.edu/cgi-bin/hgTracks?hgS_doOtherUser=submit&hgS_otherUserName=chmalee&hgS_otherUserSessionName=hg19_chr22PointInsertion

where the highlighted SNP indicates a G or GG insertion between bases 17586605 and 17586606 on chromosome 22. Because we internally store our coordinates as zero-based half open coordinates, these point insertions end up as zero length coordinates. For more information on our coordinate system please see the following blog post: http://genome.ucsc.edu/blog/the-ucsc-genome-browser-coordinate-counting-systems/