etal / cnvkit

Copy number variant detection from targeted DNA sequencing
http://cnvkit.readthedocs.org
Other
545 stars 165 forks source link

Certain BED files (Agilent OneSeq) are mis-identified by tabio #315

Closed lbeltrame closed 6 years ago

lbeltrame commented 6 years ago

cc @LMannarino

BED files generated by SureDesign may have a - in their gene name, which breaks detection by tabio (mis-identifies them as "interval") and causes issues down the road.

Example from the Agilent OneSeq BED file:

chr1    11601   11725   -       6
chr1    11736   11860   -       8
etal commented 6 years ago

Thanks for letting me know. What does the integer in the fifth column represent here?

lbeltrame commented 6 years ago

In data mercoledì 7 febbraio 2018 23:36:25 CET, Eric Talevich ha scritto:

Thanks for letting me know. What does the integer in the fifth column represent here?

It represents the number of probes mapping that region. This is a so-called "backbone" probe set, which is used specifically for structural variants also in non-coding regions.

etal commented 6 years ago

I think it's just as plausible that the "gene" name or region label in the first line of an interval list file could be all digits, so it's not possible to distinguish BED and interval list strictly by a regex of the first line.

Rather than add an option to specify the input regions file format on the command line for every CNVkit command, I'll have skgenome.tabio.read_auto first try to look at the input filename extension, match that to a known format, and only if the format's regex doesn't match (or the filename can't be determined, e.g. piped input or weird filename) fall back to regex of the first line to determine the input format.

etal commented 6 years ago

As a workaround for this case you could fix CNVkit's regex-based inference by deleting the fifth column from the BED file.

etal commented 6 years ago

With the last commit, tabio.read_auto should handle your Agilent BED file properly as long as the filename extension is ".bed".