Closed lbeltrame closed 6 years ago
Thanks for letting me know. What does the integer in the fifth column represent here?
In data mercoledì 7 febbraio 2018 23:36:25 CET, Eric Talevich ha scritto:
Thanks for letting me know. What does the integer in the fifth column represent here?
It represents the number of probes mapping that region. This is a so-called "backbone" probe set, which is used specifically for structural variants also in non-coding regions.
I think it's just as plausible that the "gene" name or region label in the first line of an interval list file could be all digits, so it's not possible to distinguish BED and interval list strictly by a regex of the first line.
Rather than add an option to specify the input regions file format on the command line for every CNVkit command, I'll have skgenome.tabio.read_auto
first try to look at the input filename extension, match that to a known format, and only if the format's regex doesn't match (or the filename can't be determined, e.g. piped input or weird filename) fall back to regex of the first line to determine the input format.
As a workaround for this case you could fix CNVkit's regex-based inference by deleting the fifth column from the BED file.
With the last commit, tabio.read_auto should handle your Agilent BED file properly as long as the filename extension is ".bed".
cc @LMannarino
BED files generated by SureDesign may have a
-
in their gene name, which breaks detection by tabio (mis-identifies them as "interval") and causes issues down the road.Example from the Agilent OneSeq BED file: