bedops / bedops

:microscope: BEDOPS: high-performance genomic feature operations
https://bedops.readthedocs.io/
Other
298 stars 59 forks source link

rmsk2bed Creates Non-Compliant BED File #246

Closed DarioS closed 3 years ago

DarioS commented 3 years ago

As shown in the documentation example, the seventh column is some non-integer number. When I try to import a similar BED file into R, I get an error:

library(rtracklayer)
> retroReferenceLocations <- import("repeatsMini.bed")
Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec,  : 
  scan() expected 'an integer', got '1.3'

Indeed, the UCSC Genome Browser documentation seems to imply that the seventh and eighth columns should be integers.

kskuchin commented 3 years ago

I'm having similar issues (with gtf2bed) - is there any resolution yet?

alexpreynolds commented 3 years ago

With respect to rmsk2bed, I'm still unsure if this is a bug or not, as BED allows many forms that are not particularly standard (including bedgraph, for instance, which puts the score in the fourth column, instead of the fifth).

BED12 is only one format, and following this format would require RepeatMasker output to contain information that it does not presently contain, and the BED output would be "lossy".

I'll keep an eye on this as I work on the next revision, but my inclination is to call the output from rmsk2bed BED6+, where the output follows the BED specification up the sixth column.

If you run into problems importing this into other tools that expect BED12 or other variants of BED, you could use cut to get the first six columns, e.g.: cut -f1-6 in.bed > out.bed, using the output from that to bring into R. Or some R libraries will allow reading in files into tables with named columns, and a subset can be derived within R, itself.

For gtf2bed, that output is also best used as BED6+, as well, for the same reason. Some biotypes may not translate directly to BED12 elements. You might look at gtfToGenePred in the Kent Utilities toolkit as one option for handling gene biotypes, specifically.