ga4gh / ga4gh-bed

The browser extensible data (BED) file format describes genomic intervals on chromosomes or scaffolds
Apache License 2.0
0 stars 2 forks source link

Possible bug in "chrom" field #3

Open EricR86 opened 2 months ago

EricR86 commented 2 months ago

Hello,

Recently there was some work with BED files and RefSeq/Genbank chromosome IDs which typically have a period in them for versioning purposes (e.g. "NC_000001.11"). This is currently not allowed as-is in the spec. Only alphanumeric characters are allowed.

I e-mailed Jim Kent regarding this issue and this is what he had to say: "Yes, I would consider this an error. All of our parsers are good with anything but white space there. Most of our utilities will handle spaces if you throw in a -tab option, but I wouldn't want to encourage that."

EricR86 commented 2 months ago

There was another response from UCSC. Matthew Speir had this to say:

In short, we think periods should be allowed in an update to the BED specification... bigBed, bigWig, and other big* formats similarly don't have restrictions on using periods in the chrom field.

The details and initial reasoning come from specifically an engineer there named Angie Hinrichs:

When we exclusively used MySQL for storage (before bigBed, etc), we split some of our largest tracks into a table per chromosome. For example, instead of a single table "xenoMrna" there would be separate tables chr1_xenoMrna, chr2_xenoMrna and so on. This meant only characters that could be used in MySQL table names without special quoting could be used for the chrom field, because they might end up as prefixes in mysql table names. As I'm sure you know, '.' has special meaning in SQL as a separator between database, table, and field.

However, we had to stop using "split tables" when we added new organisms whose assemblies consisted of tens of thousands or even hundreds of thousands of scaffold sequences -- that would just be way too many MySQL tables. That restriction still applied to old databases with split tables, but not to new databases after a certain point.