ga4gh / ga4gh-schemas

Models and APIs for Genomic data. RETIRED 2018-01-24
http://ga4gh.org
Apache License 2.0
214 stars 114 forks source link

Use unsigned integer for positions #737

Open david4096 opened 7 years ago

david4096 commented 7 years ago

As suggested here https://github.com/ga4gh/schemas/pull/706#discussion-diff-85609058R178

Positions can be specified using unsigned, as opposed to signed integers since genomic coordinates are positive. @reece @diekhans ?

jmarshall commented 7 years ago

int64 start = 5; uint64? (Have we had this discussion already? […])

In SAM/BAM/etc, the range is a consideration, as 231 is distressingly close to the needs of genomes like wheat and 232 would buy a bit more headroom.

Here hopefully the difference in range between 263 and 264 really is immaterial! :smile:

Signed arithmetic has useful properties over unsigned arithmetic, so it is useful to keep positions using signed data types. Given two positions 0 ≤ p,q < 263, we can take their difference p - q safe in the knowledge that it too can be represented as an int64 without overflow.

If p and q were unsigned and ranging up to 264, we would either need an int65 to represent p - q (and in languages like C would need to do a lot of casting) or would use uint64 to represent |p - q| and would need a lot of painful code that had separate cases for p<_q_ and _p_>q.

david4096 commented 7 years ago

I believe this is more about the benefits of having your type system represent the domain. It can become easier to reason about, construct queries against, etc.

@jmarshall your point is well taken. I don't expect intermediate values like differences to be stored in the protobuf, having to cast to other types to perform arithmetic with good guarantees is clearly undesirable. @reece care to weigh in?