genomeannotation / GAG

Generates an NCBI .tbl file of annotations on a genome.
MIT License
64 stars 20 forks source link

Validate indices #100

Closed bruab closed 10 years ago

bruab commented 10 years ago

argh.

take a look at mRNA "BDOR_000249-RB" ... its exons have a weird, out-of-order stretch that's ~7kbp away from the end of the segment. Downstream, this gets us a SeqLocOrder error from tbl2asn.

Not sure how we managed to handle this before, but since we're now supporting reading a gff in random order, we need to do some kind of sorting before storing the indices for good.

Contradictory standards (or lack of standards) for storing indices is the issue here. GFF seems to follow a convention where negative strand exons look something like

200 250 120 180 50 90

--that is, the lower value always in column 4, but negative-strandedness represented by decreasing values from row to row. But this is not a requirement, and we've got exceptions in our inputs.

TBL looks like this (same indices):

250 200 180 120 90 50

--that is, straight up reversed. So I guess we sort(sorted()) indices when we store them, and reverse(reversed()) them when we write to tbl.

bruab commented 10 years ago

they're now stored sorted. probably should check that to_tbl() handles this correctly.