ga4gh-beacon / specification

GA4GH Beacon specification.
Apache License 2.0
32 stars 25 forks source link

Start should be 1-based, not 0-based #251

Closed teemukataja closed 5 years ago

teemukataja commented 5 years ago

In the Beacon specification, the start-key is described to be 0-based, while the VCF specification describes the position as 1-based; POS - position: The reference position, with the 1st base having position 1.

I verified this information using IGV genome browser. Upon further research, other genomic filetypes also report to be using the 1-based system.

mbaudis commented 5 years ago

The decision was early on to follow GA4GH standards, which are 0-based half open.

(The lack of a clear documentation of "GA4GH standards" strikes again ...).

So, 0 based it should be.

teemukataja commented 5 years ago

I would like to understand this use case and couldn't find anything on the past issues. Can you point me to where I can find information on this decision, if no such document of standards exist?

mbaudis commented 5 years ago

@teemukataja

  1. in VMC, which constitutes the main active GKS project https://docs.google.com/document/d/12E8WbQlvfZWk5NrxwLytmympPby6vsv60RxCeD5wc1E/edit (see page 16)
  2. in the frozen GA4GH schema https://github.com/ga4gh/ga4gh-schemas/blob/master/src/main/proto/ga4gh/variants.proto#L168

Hope this helps ...

teemukataja commented 5 years ago

Thank you.

teemukataja commented 5 years ago

@mbaudis

The VMC data model on page 16 suggests, that nucleotides follow a 1-based counting convention.

Upon reading more about bases and interbases I strongly feel that Beacon should follow the standards of genomic files, such as the VCF, which uses the 1-based system. Because the 1-based system tell the position of the base of interest, I think it fits more for the role of Beacon. Interbases might be better for applying data science on datasets, but the role of Beacon is to find those datasets first.

mbaudis commented 5 years ago

Interbase coordinates

I'm not married to any concept, but there have been endless discussions already. Also, this is a clear case where Beacon just has to pick up whatever format is selected as a "GA4GH standard". File formats and browsers all use different coordinate systems.

Quote:

Moving from UCSC browser/tools to Ensembl browser/tools or back
* Ensembl uses 1-based coordinate system
* UCSC uses 0-based coordinate system
* Some file formats are 1-based (GFF, SAM, VCF) and others are 0-based (BED, BAM)

Pinging @andrewyatz @reece ...

andrewyatz commented 5 years ago

You pinged?

With reference to GA4GH there is additional context available from @jmarshall comment on a PR of mine for refget. The use of 0-based, inclusive coordinates is now a convention of GA4GH specifications. It certainly isn't a standard. If it were this has been left behind in the pre infinity war like snap of GA4GH but the vague notion that we prefer 0-based, inclusive pervades.

mbaudis commented 5 years ago

That could have been me, repeatedly, again & again ;-)

"GA4GH really needs to have this kind of decision record in an easy to reference place"

So: Maybe this would be a tangible GKS product before Christmas - jut confirm coordinate system & document the choice?

I offer schemablocks.org :-)

On 11 Dec 2018, at 17:34, Andrew Yates notifications@github.com wrote:

You pinged?

With reference to GA4GH there is additional context available from @jmarshall https://github.com/jmarshall comment on a PR of mine for refget https://github.com/samtools/hts-specs/pull/327#issuecomment-411458808. The use of 0-based, inclusive coordinates is now a convention of GA4GH specifications. It certainly isn't a standard. If it were this has been left behind in the pre infinity war like snap of GA4GH but the vague notion that we prefer 0-based, inclusive pervades.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ga4gh-beacon/specification/issues/251#issuecomment-446269313, or mute the thread https://github.com/notifications/unsubscribe-auth/AApM1nhr9J1xl_a_MsEyKHhspU_X0jltks5u396egaJpZM4ZNaXQ.

jmarshall commented 5 years ago

I would like to understand this use case

It is clear that 0-based half-inclusive intervals are the appropriate representation to use for arithmetic (and hence machine communications). If this doesn't seem clear to you, reread the epic threads linked to in the comment (https://github.com/samtools/hts-specs/pull/327#issuecomment-411458808) that @andrewyatz pointed to.

[So it's obviously the right representation for APIs; whether it's GA4GH's policy to use this representation is a separate question.]

As Beacon is a web service API, its purpose is machine communications therefore 0-based half-inclusive is the natural representation. This statement about purpose is a bit more quibble-able, so GA4GH codified this choice as a policy, or a “standard” if you will. (Those of us who were there at the time remember this use of the word “standard” — with relief, as it was the end of endless discussions!) This is reflected in secondary sources such as the htsget spec:

We use the following pan-GA4GH standards:

  • 0 start, half open coordinates

Tragically the primary sources (some GA4GH press release or minutes of some meeting), if any, have been obfuscated by subsequent web site reorganisations…

reece commented 5 years ago

Humans use 1-based inclusive. That shouldn't and won't change.

Interbase coordinates conceptually cleaner than inclusive coordinates (regardless of base), especially when distinguishing insertions and deletions, and for edits at the terminii. APIs should use interbase.

I can't think of any technical benefit for 0-based inclusive coordinates.

jmarshall commented 5 years ago

(For the avoidance of doubt,) “interbase” and “zero-based half-inclusive” are two names for the same representation (and the latter name is more formally “zero-based half-open” I guess). I think in @andrewyatz's comment he was meaning the latter but inadvertently elided the “half-”.

reece commented 5 years ago

@jmarshall: Funny, I removed a point clarifying this because I thought it was a distraction. I guess I should have left it in.

Although interbase and 0-based, right-open are numerically equivalent, they're semantically distinct. Interbase provides important conceptual clarity.

0-based, right-open refers to residues, which makes it awkward to refer to insertion points at the terminii because you have to refer to imaginary residues. Also, with residue-based coordinates, insertions use exclusive coordinates but deletions and substitutions use inclusive coordinates. That is, 5_6 refers to the space between 5 and 6 for an insertion, but refers to 5 and 6 inclusively for a deletion or MNV.