ga4gh / ga4gh-schemas

Models and APIs for Genomic data. RETIRED 2018-01-24
http://ga4gh.org
Apache License 2.0
214 stars 114 forks source link

Moving over to per-base graphs? #444

Open adamnovak opened 8 years ago

adamnovak commented 8 years ago

As per discussion between @ekg, @richarddurbin, @haussler, myself, and I think a few others, I'm opening an issue to propose moving the graph schemas from side graphs (where bases occur on sequences, and edges connect to the sides of bases in sequences) to a more per-base conceptual model.

In a "base graph" or whatever we want to call it, each individual base is conceptually its own node, with its own ID, and joins can connect either side of one base to either side of any other base.

For efficiency, we (probably have to) group bases together into sequences, where the bases in a sequence have sequential IDs and an implied join from the end of each base in the sequence to the start of the next base (the one with the next highest ID). In this view, sequences themselves are not really first-class objects, but merely collections of bases, defined by a start base ID and a length. Grouping bases into sequences is merely a way of compressing the underlying base graph.

I like this model a lot better, because it makes subgraphs much more natural (see the long discussion in #275). It also lends itself quite well to a succinct data structure representation, letting us avoid some of the size issues we might be liable to run into with the SQLite database we currently have. Finally, it lets us give every base a stable numerical identifier, while also allowing bases to be efficiently removed from the graph without recourse to some kind of ignore list (since the sequence with the bad base can be cut without changing any base IDs).

Are there any reasons (besides backwards compatibility with the existing graph API, which I don't think we need to worry about too much at this point) not to do it this way?

jeromekelleher commented 8 years ago

Interesting. To me it seems more intuitive, and lines up nicely with our PRG model. I guess we'd need to write down what the advantages of the side graph were over other alternatives, and see how this representation stacks up.

One obvious question: how do we manage the notion of the 'golden path' corresponding to the current linear reference? This was easy in the side graph.

adamnovak commented 8 years ago

One way to manage that notion is to specify your graph so that the GRCh38 chromosomes are all representable by single sequences, coming one after the other in ID space, and then treat those ID ranges and their implicit joins as the primary path, sort of like what we do now.

You can also send down the subgraph (as sequence ID ranges and necessary explicit joins) corresponding to the GRCh38 primary path, no matter how it is embedded in the graph. However, the order in which it was to be traversed would not really be specified.

In general, we might want to have a way to specify a named path or path collection in the graph, much as VG does.

richarddurbin commented 8 years ago

I am keen about this idea.

The route forward is to prototype it on the full human genome scale. We backed out the side graphs from the main GA4GH branch, so, although we thought a lot about them and want to keep what we learned from that, we have not locked them in and are free to find the right long term solution.

The current regional pilots are important, but we need to be comfortable with a full genome practical implementation before committing to a standard. VG is aimed at providing one option for this.

Another important factor is that we can separate the internal representation of the software/storage from the exchange API so long as they are equally powerful and efficiently interconvertible. The API requires reference stability, which is the advantage of side graphs which build progressively on canonical components. Using single identified bases as the primitives seems potentially even better in that direction.

Richard

Sent from my iPhone

On 23 Oct 2015, at 4:03 am, Jerome Kelleher notifications@github.com wrote:

Interesting. To me it seems more intuitive, and lines up nicely with our PRG model. I guess we'd need to write down what the advantages of the side graph were over other alternatives, and see how this representation stacks up.

One obvious question: how do we manage the notion of the 'golden path' corresponding to the current linear reference? This was easy in the side graph.

— Reply to this email directly or view it on GitHub.

The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE.

lh3 commented 8 years ago

So, a side graph labels a base by a tuple (seqName,seqPos) while a base graph labels a base by an integer ID and keeps the ID<=>seqName relationship in a separate table. Is this the major literal difference? I understand the major conceptual difference is to demote the role of linear sequences. I am not sure yet what this will lead to.

ekg commented 8 years ago

the major conceptual difference is to demote the role of linear sequences

We retain the importance of linear sequences so long as they can described in the graph. I think this is essential for any practical graph reference model, and it opens up some nice possibilities. For instance, you could retain several reference sequences in the same graph. Similarly, we might be able to shift to using the ancestral reference for analyses that benefit from that, while still using annotations derived against the current GRC reference.

vg graphs are very heavyweight when expressed using one-base nodes. I'm not even sure I can efficiently construct one for the whole genome without the memory savings of only specifying an ID every 11bp. That said, as things grow we will tend to shorter and shorter sequences, so this needs to be addressed.

I'm envisioning something like the xg model as the natural way to think about the reference. In that case we have a single vector containing all of the sequences in the nodes of the graph. On top of this we build out edges (joins). Provided we only append to the sequence vector, our sequence space maintains a stable ID as a position in the DNA label vector. Here are two slides describing the rough details of the model:

vg probgen15

This could be implemented differently, depending on taste. If we used the 1bp model the node id / internal rank translation vector could go away, and instead we'd just keep ids based on the rank in the DNA vector ("label" vector in the previous slide). The edge to/from indexes wouldn't be stable at all under the addition of new joins/edges to the system.

vg probgen15 1

Paths would look very similar, but maybe would be more complex if every base was uniquely identified. If global identification were implicit inside of xg then maybe this wouldn't be a problem. For instance, I could have an internal ID set kept for convenience, but report things to the rest of the world relative to offsets in the base DNA vector.

It's not clear to me how to proceed. We want global coordinates but maybe things are not so easy. @haussler's ideas about graph homomorphism (which I still need to grok) might be helpful here. As long as you maintain the same paths in different graphs you have a good chance of describing a morphism that takes you from one to the other, so global coordinates fall out of the maintenance of sequences on the graph.

denizkural commented 8 years ago

Awesome effort! We're in support of "base graphs". We were just beginning to love side graphs.

As @lh3 also touched upon, it would be great to separate out a conceptual definition from implementation issues. For instance as @ekg pointed out there is implementation benefit to process linear parts in some batched fashion. I agree.

One thing we do is use edge-graph, which has implementation advantages. These edge-graphs are a strict subset of vertex-graphs, thus the main disadvantage is that it's a bit less expressive (introducing a conceptual difference), but there are easy ways to get around those limitations in the practical case of representing genomic variation.

Thus we'd appreciate if the base-graph definitions/specs could allow for both vertex-graphs and edge-graphs.

At a methodological level, it would be great to agree on a more appropriate set of formal languages for this type of spec, just so it is clear to more people at which layer of abstraction we'd like these definitions to work at. We could pick multiple languages - but at least for those who reject the more "close-to-machine" ones, there is an opportunity to debate at a more conceptual level. Likewise, for those who agree on the abstract stuff they can duke it out on implementation literals.

ekg commented 8 years ago

@denizkural Do you have any documentation on what an edge graph is and how it's not compatible with the graphs that put sequence labels on their nodes?

benmurraysbg commented 8 years ago

Does this initiative still have steam? It is a while since we heard anything from it. We've been fans of the side-graph representation since it was announced but we can still reasonably understand the benefits of a per-base representation.