ga4gh-schemablocks / ga4gh-schemablocks.github.io

Website of the GA4GH SchemaBlocks Project
The Unlicense
2 stars 6 forks source link

document genome coordinate use #3

Closed mbaudis closed 1 year ago

mbaudis commented 5 years ago

@andrewyatz @reece @jmarshall As the previous Beacon discussion did show, we need a place to "document standards & conventions used in GA4GH space".

We have just launched the SchemaBlocks initiative, as a community effort to make it easier to document, find & re-use such information. Genome coordiate use conventions represent one of the most obvious, most important aspects here. I would be very happy if one or all of you could provide documentation/recommendations here!

The placeholder page is here: https://github.com/ga4gh-schemablocks/ga4gh-schemablocks.github.io/blob/master/pages/_formats/genome-coordinates.md - but feel free to add more pages for different aspects/topics.

The web page will be rendered here https://schemablocks.org/formats/genome-coordinates/ (with some delay after merging etc.).

Let me know if you need access rights...

jmarshall commented 5 years ago

I think we are all agreed that “0-based half-open” or “interbase” (whichever name you choose to use for the same representation and mindset) is the correct representation. Where we disagree is on whether that was ever promulgated as a policy or standard by GA4GH.

SchemaBlocks could document the convention, but does SchemaBlocks have the authority to decree that it is a GA4GH standard? If not, who does? I suppose that is one of the “governance structures” being developed…

(Does anyone have a cache of historical GA4GH presentations and slides?)

mbaudis commented 5 years ago

The GA4GH schema has inline documentation (e.g. our archived version here https://github.com/ga4gh-metadata/ga4gh-schemas-legacy/blob/metadata-integration/src/main/proto/ga4gh/variants.proto#L168 ...). The documentation https://ga4gh-schemas.readthedocs.io/en/latest/schemas/variants.proto.html (well, rendered from the schema) doesn't have more information:

So, based on this & the previous discussions, it would be great if someone could write this up in a nice way, with some "user friendly" syntax.

(IMO integrating or at least pointing to Obi's reference explanation would be nice https://www.biostars.org/p/84686/ ...).

Note: I try to muster engagement here...

jmarshall commented 5 years ago

I am happy to collate the good stuff from the previous discussions and other sources (biostar postings, the interbase-style description in the Exonerate man pages) and write it up in markdown, especially if it means I never have to discuss coordinate schemes again.

But what is the procedure to bless such documentation such that @andrewyatz will feel able to stop saying “it may be a good idea, but it is not a GA4GH standard”? 😛

mbaudis commented 5 years ago

Well, oligarchic decision making, I guess :-)

But seriously: This is for us also now a test-bed of how to elevate such recommendations towards "standard level". My current concepts (which have to be formalised on the go) are:

jmarshall commented 5 years ago

Apologies for the delay. As a result of this issue discussion, I had been drafting such a document.

This is now PR #6. It would be good to combine the best parts of that and PR #4…

jmarshall commented 5 years ago

Both PRs #4 and #6 have now been merged as separate web pages.

As @mbaudis noted on #4, that page could be condensed a bit — and there is quite some duplication of information between the two pages. IMHO the best way to deal with this is to merge the useful contents back into a single page.

This issue (needing to document genome coordinate use) arose after ga4gh-beacon/specification#251, which was somebody expecting Beacon to use the familiar human-readable 1-based coordinate system. Similarly in the GA4GH schema days, there was ga4gh/ga4gh-schemas#121, which was somebody wondering why the Schema APIs did not use the familiar human-readable 1-based coordinate system.

IMHO what is needed in a genome-coordinates SchemaBlock is a description of the recommended 0‑based half-open / interbase representation (in both those guises equally) and a convincing explanation of why this is a better choice for APIs than the familiar 1-based coordinate system.

Anything beyond that is a distraction.

The pages as currently merged to master somewhat overconstrain the representation (being distracted by integer widths and non-circular sequences) and have crept towards expressing a preference for the interbase interpretation of this coordinate system. IMHO this is unnecessary and detracts from the primary mission of this SchemaBlock.

For those used to the familiar 1-based representation, both the interbase mindset and the up-to-but-not-including half-open mindset require non-trivial leaps of imagination to conceptualise. When used to these mindsets, both model indels and other between-base events well. It is a matter of individual opinion which leap is the easier to make, and GA4GH should not be in the business of promoting one or the other — not on a page that is trying to make a recommendation to use something other than the familiar 1-based representation.

In PR #7 I've tried to merge the two pages, carrying over the additional useful information from the shorter one, and discussing the two mindsets equally. IMHO this serves the intended audience best.

mbaudis commented 5 years ago

@jmarshall @andrewyatz So following the new version by @jmarshall, I have used his last version as the "recommendation" page, while moving the "current use" part to the "Genome Coordinate Use in GA4GH" page; which has now only a general in-text recommendation and points to the "Recommendation - Genome Coordinates" document. I've kept the "(DRAFT)" there, as long as we don't have a documented approval process for a SchemaBlocks "Recommended" category.

andrewyatz commented 5 years ago

I've been reading through the two documents and going back to the original Google document I wrote. I can see both of your points here and how you're both trying to push this.

Overall I agree with John's thrust that splitting the document into two and the focus of the main document dilutes the recommendation we're trying to reassert. Also stopping me from saying anything on the lines of “it may be a good idea, but it is not a GA4GH standard” is also an excellent goal I support. I'd like to take what we currently have and take a new stab at a new iteration of the unified document that first puts the decision to use interbase first, explains the convention but retains the background to provide historical context and respects historical decisions.

I'll try to retain current commits & authorship to the best of my ability.

Hope this is okay with you both.

mbaudis commented 5 years ago

@andrewyatz @jmarshall O.k.; so then 1 document, and apologies if I messed something up => please go ahead ...

andrewyatz commented 5 years ago

Don't apologise @mbaudis we're going to be in a stronger position because of all of our efforts here

jmarshall commented 5 years ago

My bottom line is that “0-based half-open” and “interbase” must be treated equally in this recommendation. The interbase advocates can have VMC and advocate for it there, and that is fine.

But this recommendation is about why not to use the familiar 1-based representation. I would not support this recommendation if it got distracted into advocating for the interbase interpretation (for the reasons I tried to describe in https://github.com/ga4gh-schemablocks/ga4gh-schemablocks.github.io/issues/3#issuecomment-465218396 about leaps of imagination).

andrewyatz commented 5 years ago

Sorry for not coming back on this for so long. I will get onto this soon

andrewyatz commented 5 years ago

See #11

jmarshall commented 5 years ago

I see #11 has now been merged, without public discussion. Perhaps it was discussed at the Hinxton meeting (which unfortunately I was unable to attend), but if so that is not reflected in the minutes.

In PR #7 I admittedly removed most of the verbatim text that @andrewyatz proposed, incorporating salient parts of it into the text that I had written and anticipating future PRs that would augment that by reinstating the table and other sections that provided interesting addition information (https://github.com/ga4gh-schemablocks/ga4gh-schemablocks.github.io/pull/7#issuecomment-470063807).

Instead PR #11 reinstates wholesale all the text that IMHO is a distraction (https://github.com/ga4gh-schemablocks/ga4gh-schemablocks.github.io/issues/3#issuecomment-465218396), including some of the errors that I have previously noted (https://github.com/ga4gh-schemablocks/ga4gh-schemablocks.github.io/pull/4#issuecomment-465218193), and removes most of the text that I wrote.

Fair enough; in 68972e8a8bfcdb65a52af5171b52dc3c8127a127 I have completed that removal. To be clear: I am withdrawing my permission for SchemaBlocks to use any text that I have written or images that I have drawn.

mbaudis commented 5 years ago

@jmarshall Apologies - that was confusion then from my part; I was acting on the belief that the 2 attempts were aligned & then not seeing any further changes/comments/discussions after #11 on March 26.

I am acting here solely as arbiter - so please can you align this @andrewyatz & @jmarshall ?