ga4gh / ga4gh-schemas

Models and APIs for Genomic data. RETIRED 2018-01-24
http://ga4gh.org
Apache License 2.0
214 stars 114 forks source link

What's the GA4GH position on 0 vs 1 based positions? #121

Closed calbach closed 9 years ago

calbach commented 9 years ago

All coordinates of the GA4GH schema are currently 0-based. Some concern came up recently when I was discussing this with a colleague who is very familiar with the genetic research field. My impression and his, is that nearly all biologists (and many existing databases/tools/repositories) communicate in terms of 1-based genomic coordinates. For example, scientific papers typically refer to a specific locus by a position which is implicitly 1-based. I didn’t see a convincing amount of discussion of this topic in past GitHub issues (#5, #93), and I’m unsure as to what message the choice of a 0-based coordinate is sending. I can imagine the GA4GH taking one of three positions on the matter:

  1. Our API is 1-based. Biologists and existing databases will use a 1-based coordinate system.
  2. Our API is 0-based. Clients of our API will add 1 downstream, thereby preserving the existing 1-based coordinate system used by biologists.
  3. Our API is 0-based. Biologists should shift towards using a 0-based coordinate system.

Computer scientists including myself dislike (1) for reasons of consistency. (2) seems very error-prone to me, as all clients must remember to add one before presenting results to the user. Off by one errors will be rampant; for instance, it might be difficult to pipe one GA4GH tool into another, as both would assume 0-based input and 1-based output. (3) would require a large amount of migration and community buy-in; I’m unqualified to assess the impact.

Currently, the GA4GH is either in camp (2) or (3). I’m trying to understand which.

lh3 commented 9 years ago

A lot of bio formats and databases use 0-based coordinates, including most UCSC tables/formats (e.g. BED and BLAT) and most binary formats (e.g. BAM, BCF and CRAM).

pgrosu commented 9 years ago

@calbach, stick with 0-based. You'll begin to love it as the API becomes more complex, and if you'll have to build customized pipelines in the future :)

calbach commented 9 years ago

@lh3 I think you are most closely advocating for (2); correct me if I'm wrong.

From my point of view a file format is just an implementation detail. A GA4GH implementation can of course store the position however they wish internally, I think the interface/library used to access these files makes for a better comparison when discussing an API. I guess for BED your point is valid, as the file format appears also to be the API. However, most sane code I've seen which interacts with BAM parses it as Picard SAMRecords which are 1-indexed. I don't know how CRAM records are typically consumed. Please correct me if I'm wrong, you are definitely better qualified to provide an accurate portrayal of the typical BAM usage patterns. Personally I've always thought of the SAM spec as the "API" with BAM and to a lesser extent CRAM, which I'm less knowledgable about, as more efficient implementations of that API. I assume a similar pattern holds true for VCF and BCF.

Regardless, I think an API must be held to a higher standard for being broadly consumable than does a file format.

calbach commented 9 years ago

But also, point taken that there is significant precedent for 0-based bio data formats.

lh3 commented 9 years ago

Thanks for reminding me that the schema is for APIs, which I frequently forget. That being said, 0-based and 1-based coordinate systems have coexisted for over 10 years. Most UCSC users know the coordinates they get from the table browser are 0-based. Note that I am neutral on 0-based or 1-based coordinates in APIs. Just point out that 0-based coordinates are also frequently seen in web interface, though not as often as 1-based.

PS: in the C BAM/SAM/VCF/BCF APIs, coordinates directly accessed by programmers are all 0-based.

pgrosu commented 9 years ago

@calbach, I know what you mean in that you want to make this perfect, and in this case you can have your cake and eat it too :) So internally if you manipulate sequences directly it would probably be more 0-based, but if you shift to methods/classes (i.e. getOverlaps(ListOfSequences)) then 1-based will become just as easy to work with. In the end, to help out the end-users you can have a number-line (or 0/1-indicator) in the browser/exported file to help them interpret the data properly. If necessary users can just perform a quick update to the data, as shown below. So I generated a very simplistic pipeline to demonstrate this - the little mouse symbol indicates a right-click on the pipeline module, with the associated context menu:

functional analysis pipeline

jmarshall commented 9 years ago

IMHO the correct position for GA4GH to take is to recognise that there are two forms of communication going on here: there is mechanical communication between programs via the API, and there is communication when a program displays output to a human. These aren't the same: the former might be binary while the latter will certainly be textual or graphical; so they needn't be the same in the 0-based v. 1-based detail either.

There is a boatload of previous discussion in #49 too.

(2) Our API is 0-based. Clients of our API will add 1 downstream, thereby preserving the existing 1-based coordinate system used by biologists.

The correct representation for doing arithmetic on intervals is 0-based half-open (see #49 discussion). So the correct representation for mechanical communication for intervals is 0-based half-open and by extension for positions it is 0-based. So IMHO the entire API should be 0-based, because the API is for mechanical communication between programs.

When it comes to presenting results to biological users, many programs will probably choose to do so by presenting them in the familiar (to this human audience) 1-based form in human-readable output. For example, by looking at the source code and table dumps (as @lh3 noted) it is clear that the UCSC browser is internally 0-based but presents web pages to users in a 1-based view.

So I suppose that the position I'm advocating is your (2), except that your characterisation doesn't distinguish between clients producing machine-readable output (cf BED files) that will likely want 0-based coordinates as is, and clients when they are producing human-readable output that will likely want to add 1 to positions and interval start positions just once, in their model-to-view layer.

(2) seems very error-prone to me, as all clients must remember to add one before presenting results to the user.

I suppose there is a trade-off in error-proneness here. Representing indels, computing interval lengths, reverse-complementing coordinates, and interval arithmetic in general are all very error-prone in anything other than 0-based half-open coordinates. IMHO this source of bugs greatly outweighs the smaller source for bugs from forgetting to add 1 in a well-defined model->view translation layer.

Off by one errors will be rampant; for instance, it might be difficult to pipe one GA4GH tool into another, as both would assume 0-based input and 1-based output

I don't think this problem will eventuate. Output suitable for piping into another GA4GH tool will naturally be machine-readable output, and will thus be machine-orientated 0-based. For flexibility, where practical, tools would probably be able to read human-orientated output from other tools too, but such output would identify itself as 1-based and as needing conversion on input.

We have this today: BAM files have 0-based representation, while human-readable SAM files have an externally 1-based representation but are typically converted to their BAM equivalent on input. Programs like samtools read BAM or SAM and convert SAM on I/O as appropriate.

calbach commented 9 years ago

Thanks @jmarshall for the response.

because the API is for mechanical communication between programs.

First and foremost the API exists for human (programmer) comprehension. Defining the API in a language like Avro IDL means we can generate language-specific constructs and can convert to a wire protocol for communication between programs. A machine will not care if the contents of the field are 0 or 1 based, as it will make little impact on performance, unless someone can produce a specific convincing example of why it matters for ga4gh specifically (it generally doesn't). The contents of this field should be optimized for consumption by a programmer; perhaps this is what you were saying.

If you accept this, the question can now be framed as follows: Which of the following bears the lowest cognitive load on the programmer: (1) deviating from the norms of computer science, where arrays/intervals are typically 0-based, or (2) deviating from the 1-based norm of Genomics, and the way in which that value is expected to be presented to some subset of end users?

Note that with (2), the programmer must make a subjective decision to emit the coordinate as 0-based or 1-based given the context. As a programmer I'm happier, more productive, and introduce fewer bugs when I don't have to make such decisions.

There is a boatload of previous discussion in #49 too.

Thanks, I hadn't seen #49. I agree that all other things equal, 0-based half-open intervals are ideal. In this case I just happen to think that all other things are not equal.

So I suppose that the position I'm advocating is your (2), except that your characterisation doesn't distinguish between clients producing machine-readable output (cf BED files) that will likely want 0-based coordinates as is, and clients when they are producing human-readable output that will likely want to add 1 to positions and interval start positions just once, in their model-to-view layer.

All other things equal, I would say that an environment where you have to make this distinction is more complicated than one where you don't.

it is clear that the UCSC browser is internally 0-based but presents web pages to users in a 1-based view.

Okay, I buy that they implemented it this way - why is this superior to using 1 throughout? Also, as I've noted before an internal implementation should be free to choose whatever 0 or 1 or -1 based positioning it desires so long as the API is uniform.

Representing indels, computing interval lengths, reverse-complementing coordinates, and interval arithmetic in general are all very error-prone in anything other than 0-based half-open coordinates.

I don't see a reason why 1-based half-open intervals would be more prone to these errors than 0-based half-open intervals.

I'm pretty sure there's a cost to (2) and I think you agree from your comments. I'm just still failing to see major benefits to the 0-based approach, aside from the fact that I would have made it 0-based if I were designing it from scratch, and it will slightly bother me when I have to use 1-based coordinates.

jmarshall commented 9 years ago

I don't see a reason why 1-based half-open intervals would be more prone to these errors than 0-based half-open intervals.

Then you should try implementing some interval arithmetic!

Suppose you have some feature on an exon, represented as [fs, fe) relative to the start of the exon. In turn, the exon is at coordinates [es, ee) on a chromosome. What are the coordinates of the feature in chromosome space? First write a function to compute the feature's chromosomal coordinates assuming that fs,fe,es,ee are all 1-based; then do it again assuming they are 0-based.

(This is a trivial example. It gets a lot less fun when you have to negate and reverse things due to strandedness.)

calbach commented 9 years ago

Thanks, I see it now - I was previously considering only intervals in chromosome space. This is a good tangible downside to (1). Certainly nested relative 1-based intervals are cumbersome to work with.

malachig commented 9 years ago

Here are some helpful visuals and discussion from Biostars that seem very relevant to this discussion: https://www.biostars.org/p/84686/

JimKent commented 9 years ago

The UCSC coordinates system represents the first 100 bases of a chromosome as 0,100 and the next 100 as 100,200, and so forth. The advantages of this are many. Perhaps most importantly you can represent something (such as a cut or an insertion) that happens between bases by using the same number for start and end. Less important the calculation to get the size of the region is simple. In general it resorts to a lot less + and - 1's in the processing code, though admittedly you have to do bedStart = gffStart - 1 or gffStart = bedStart + 1 conversions when going between formats that use one and ones that use another.

calbach commented 9 years ago

Thanks @malachig. I think this certainly drives home the point that choosing any standard would improve the field.

This is a telling quote:

If it doesn't have off-by-one errors, it isn't bioinformatics. -@dasmoth

calbach commented 9 years ago

Thanks @JimKent. These are properties of closed vs half-open intervals; I believe both should hold true of 1 or 0 based coordinates, right?

lh3 commented 9 years ago

By convention, the 0-based coordinate system uses half-close-half-open intervals; the 1-based uses closed intervals. The two systems have caused enough confusions. Certainly we wouldn't want to see 0-based closed intervals or other combinations (there are 8 in total).

calbach commented 9 years ago

Yes, you're right. 1-based exclusive doesn't really make sense and indeed there are additional downsides to an inclusive range.

I've yet to hear from a proponent of 1-based intervals on this thread which makes me a bit uneasy as I'm confident such people exist, but the arguments around interval arithmetic are fairly convincing so I'll stop playing devil's advocate.

I think the next step here is to document a disclaimer and given example scenarios where a client might want to 'add one' before presenting the position to a user. One example scenario I have in my head is where a user wants to build a quick weekend project which visualizes some data in a ga4gh repository (say, reads). If the user doesn't know any better, positions in this visualization will wind up being off-by-one.

To be concrete, should the ga4gh documentation recommend adding 1 for commandline I/O and/or GUIs? Or should it be silent on the issue?

I will send a pull request with the documentation change on GAPosition, which minimally makes note of the fact that in the genomic field of research a genomic coordinate is typically 1-based.

lh3 commented 9 years ago

To be concrete, should the ga4gh documentation recommend adding 1 for commandline I/O and/or GUIs? Or should it be silent on the issue?

minimally makes note of the fact that in the genomic field of research a genomic coordinate is typically 1-based.

I would just keep silent. Most users understand the implication of 0-based coordinates and know how to deal with them. 1-based coordinates are more popular, but they are not really "typical".

delagoya commented 9 years ago

I agree with @lh3. Best to fully explain what is there, and stay mute on other systems.

fnothaft commented 9 years ago

+1 @lh3 @delagoya