ga4gh / ga4gh-schemas

Models and APIs for Genomic data. RETIRED 2018-01-24
http://ga4gh.org
Apache License 2.0
214 stars 114 forks source link

URIs in the GA4GH domain for Variation Graph RDF concepts #572

Open JervenBolleman opened 8 years ago

JervenBolleman commented 8 years ago

Some of us would like to work with Variation Graphs using the standardized W3C technologies centered around RDF. We can do this outside of the GA4GH but would prefer that it is made part of the GA4GH deliverables.

What is RDF

For those not familiar with RDF (Resource Description Framework), this is a graph format where all nodes and edges are identified by URIs. Nodes can be linked via an edge to literal values. RDF is already used in the life science community e.g. at UniProt, and the EBI RDF platform. RDF data can be queried using a standardized graph query language called SPARQL. There is also a large variety of RDF/SPARQL databases, a number of which are tuned for clinical environments. RDF is an information model with a large number of compatible serializations, e.g. JSON-LD, RDF-THRIFT, HDT, Turtle.

Proposed concepts

We propose the following concepts, from VG RDF.

Nodes, are nodes in a variation graph. They represent a continuous sequence of DNA (RNA or Amino-acids if RNA/protein variant graph). Nodes can be linked to other nodes sequence in 4 different ways, forward to forward, forward to reverse, reverse to forward, reverse to reverse. They can be 1 nucleotide, amino acid or more as required.

A path in a variation graph represents a linear sequence, for example a reference genome but could also be an Ensembl gene or the assembled genome of a patient.

Steps are a required artifact to link paths to the nodes. They make the list/vector/array of nodes in a path explicit.

See a HTML rendering of the proposed RDF schema/ontology.

Also see the RDF section of the VG Team Wiki for further details and examples of what can be done with Variation Graphs using of the shelf RDF technology.

Desired result

URIs of the type http://biohackathon.org/resource/vg#Node will instead be something like http://purl.ga4gh.org/schema/ontology/variation_graph#Node. These URIs should resolve correctly and do content negotiation.

haussler commented 8 years ago

very cool! I hope we can prototype this out.

On Thu, Mar 10, 2016 at 6:52 AM, JervenBolleman notifications@github.com wrote:

Some of us would like to work with Variation Graphs using the standardized W3C technologies centered around RDF. We can do this outside of the GA4GH but would prefer that it is made part of the GA4GH deliverables. What is RDF

For those not familiar with RDF http://rubenverborgh.github.io/WebFundamentals/ (Resource Description Framework), this is a graph format where all nodes and edges are identified by URIs. Nodes can be linked via an edge to literal values. RDF https://en.wikipedia.org/wiki/Resource_Description_Framework is already used in the life science community e.g. at UniProt http://sparql.uniprot.org, and the EBI RDF http://www.ebi.ac.uk/rdf/ platform. RDF data can be queried using a standardized graph query language called SPARQL. There is also a large variety of RDF/SPARQL databases, a number of which are tuned for clinical environments. RDF is an information model with a large number of compatible serializations, e.g. JSON-LD http://json-ld.org/, RDF-THRIFT http://afs.github.io/rdf-thrift/, HDT http://www.rdfhdt.org/, Turtle https://www.w3.org/TR/turtle/. Proposed concepts

We propose the following concepts, from VG RDF.

  • Node
  • Path
  • Step

Nodes, are nodes in a variation graph. They represent a continuous sequence of DNA (RNA or Amino-acids if RNA/protein variant graph). Nodes can be linked to other nodes sequence in 4 different ways, forward to forward, forward to reverse, reverse to forward, reverse to reverse. They can be 1 nucleotide, amino acid or more as required.

A path in a variation graph represents a linear sequence, for example a reference genome but could also be an Ensembl gene or the assembled genome of a patient.

Steps are a required artifact to link paths to the nodes. They make the list/vector/array of nodes in a path explicit.

See a HTML http://biohackathon.org/resource/vg.html rendering of the proposed RDF schema/ontology.

Also see the RDF section of the VG Team Wiki https://github.com/vgteam/vg/wiki for further details and examples of what can be done with Variation Graphs using of the shelf RDF technology. Desired result

URIs of the type http://biohackathon.org/resource/vg#Node will instead be something like http://purl.ga4gh.org/schema/ontology/variation_graph#Node. These URIs should resolve correctly and do content negotiation.

— Reply to this email directly or view it on GitHub https://github.com/ga4gh/schemas/issues/572.

JervenBolleman commented 8 years ago

You can try it out here, this is just a tiny sample showing the GRCh38 alternative HLA B-3106. We can do the full graph, it's just very easy in that case to ask for more data than your browser can deal with. It is running on Dydra.com's SPARQL as a service offering, but any standard SPARQL 1.1. endpoint will work.

How to determine node identity (generating URIs for each node in the graph) should be treated in a different issue. There are some ideas on how to do this but no code yet.

ekg commented 8 years ago

@haussler I think all we need is for the GA4GH to agree to host stable URLs with metadata and we are ready to go.

vg can now read and write RDF. Any graph we have made can be converted into triples. And, any graph described in triples can be read into vg.

Paths through the graph can represent existing genomic annotations and alignments. By working on the RDF version of the graph, we can query these paths using SparQL to return relevant parts of the graph or get the annotations in a given subgraph.

This means we can embed any annotations into the graph and link to them using semantic web approaches. The result is that we can shift the burden of annotation and annotation-based queries onto generic semantic web technologies rather than attempting to craft our own solution. We will be able to use off the shelf components for the linked data describing the RDF version of the GA4GH data model.

vg-RDF makes it easier for us to hook up with existing resources in RDF like Uniprot and Ensembl. I think that represents a huge win.

This will still leave us talking about standards for to representing things on the graph. That doesn't get easier. However, I think RDF can help us leverage existing work on semantic models. For instance we might use an off the shelf ontology to describe a patient or medical procedure rather than crafting our own model for this in the API.

benedictpaten commented 8 years ago

On Tue, Mar 15, 2016 at 7:07 AM, Erik Garrison notifications@github.com wrote:

@haussler https://github.com/haussler I think all we need is for the GA4GH to agree to host stable URLs with metadata and we are ready to go.

vg can now read and write RDF. Any graph we have made can be converted into triples. And, any graph described in triples can be read into vg.

Paths through the graph can represent existing genomic annotations and alignments. By working on the RDF version of the graph, we can query these paths using SparQL to return relevant parts of the graph or get the annotations in a given subgraph.

This means we can embed any annotations into the graph and link to them using semantic web approaches. The result is that we can shift the burden of annotation and annotation-based queries onto generic semantic web technologies rather than attempting to craft our own solution. We will be able to use off the shelf components for the linked data describing the RDF version of the GA4GH data model.

vg-RDF makes it easier for us to hook up with existing resources in RDF like Uniprot and Ensembl. I think that represents a huge win.

This will still leave us talking about standards for to representing things on the graph. That doesn't get easier. However, I think RDF can help us leverage existing work on semantic models. For instance we might use an off the shelf ontology to describe a patient or medical procedure rather than crafting our own model for this in the API.

+1 this approach, but I'd love to see some data on how many triples a reference genome graph will contain. Surely we're talking about a very sizable data-structure just for the graph itself?

PS, @ekg - hope you can join the ref-var call to discuss the gPBWT implementation in xg tomorrow.

— You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub: https://github.com/ga4gh/schemas/issues/572#issuecomment-196836492

JervenBolleman commented 8 years ago

@benedictpaten the VG team wiki contains some information on that. The 1000 Genomes UK + GRCh37 is about 2 billion triples (only 10% of the current UniProt). That sounds like a lot but once loaded into a SPARQL database, like virtuoso, that is only 51GB of disk space used including indexes etc... If required that could be reduced further. Different SPARQL databases will have differing exact numbers.

Variantion graph specific software can achieve higher compression ratio's. However, as you get more and more general by e.g. including random annotation the variation tool will become a more general database and the compression rate will drop.

For VG it is quite possible to run SPARQL directly against the VG data-structure given some coding time. This is not terribly difficult code to write, given an existing open source SPARQL engine, as xg is a data-structure that is extremely similar to straight forward SPARQL disk-stuctures. This would be an interesting project for a hackathon. The main problem is there is no good open source SPARQL engine in C++. But FFI from Ruby, Python, Perl or Java JNI hooking up to VG is a workable approach.

Also this is a minimal conceptual schema, one could compress further by leaving more triples implicit as well as subpath sharing. But I wouldn't advice making the schema more complicated before there is a shown need for it.

benedictpaten commented 8 years ago

On Wed, Mar 16, 2016 at 1:05 AM, JervenBolleman notifications@github.com wrote:

@benedictpaten https://github.com/benedictpaten the VG team wiki contains some information on that. The 1000 Genomes UK + GRCh37 is about 2 billion triples (only 10% of the current UniProt http://sparql.uniprot.org/). That sounds like a lot but once loaded into a SPARQL database, like virtuoso, that is only 51GB of disk space used including indexes etc... If required that could be reduced further. Different SPARQL databases will have differing exact numbers.

That's actually not too bad! Two things:

-- I believe that individual genomes, stored as sets of paths through the graph will need to be stored in a less general, more efficient data structure, because we will have hundreds and ultimately hundreds of thousands of them. Adam has been developing the gPBWT, which is a succinct store for paths on the graph that extends Richard Durbin's PBWT. It would be great to think through how that would interface with the linked data, or if it is simply stored and accessed through a different interface.

-- It would be fun to present use case queries that genome SPARQL graph could answer. There is a need to get everyone on the same page about this. Would you and @ekg be interested in presenting on the ref-var call (next one will be Wednesday 30th at 9am pacific/6pm Paris)? I am imagining this world where we have agreed on reference genome graph in which all the common variants are named and we are using linked data to associate variants to biomedical information in a very flexible, powerful way, with the full power of a relational (albeit slow) interface built upon it! I know others (you, Toshiaki, @ekg, etc.), have thought about this for awhile, others of us are just catching up ;)

Variantion graph specific software can achieve higher compression ratio's. However, as you get more and more general by e.g. including random annotation the variation tool will become a more general database and the compression rate will drop.

For VG it is quite possible to run SPARQL directly against the VG data-structure given some coding time. This is not terribly difficult code to write, given an existing open source SPARQL engine, as xg is a data-structure that is extremely similar to straight forward SPARQL disk-stuctures. This would be an interesting project for a hackathon. The main problem is there is no good open source SPARQL engine in C++. But FFI from Ruby, Python, Perl or Java JNI hooking up to VG is a workable approach.

Also this is a minimal conceptual schema, one could compress further by leaving more triples implicit as well as subpath sharing. But I wouldn't advice making the schema more complicated before there is a shown need for it.

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/ga4gh/schemas/issues/572#issuecomment-197203547

haussler commented 8 years ago

This would all be hugely awesome! Gamechanger! -D

On Wed, Mar 16, 2016 at 8:53 AM, Benedict Paten notifications@github.com wrote:

On Wed, Mar 16, 2016 at 1:05 AM, JervenBolleman notifications@github.com wrote:

@benedictpaten https://github.com/benedictpaten the VG team wiki contains some information on that. The 1000 Genomes UK + GRCh37 is about 2 billion triples (only 10% of the current UniProt http://sparql.uniprot.org/). That sounds like a lot but once loaded into a SPARQL database, like virtuoso, that is only 51GB of disk space used including indexes etc... If required that could be reduced further. Different SPARQL databases will have differing exact numbers.

That's actually not too bad! Two things:

-- I believe that individual genomes, stored as sets of paths through the graph will need to be stored in a less general, more efficient data structure, because we will have hundreds and ultimately hundreds of thousands of them. Adam has been developing the gPBWT, which is a succinct store for paths on the graph that extends Richard Durbin's PBWT. It would be great to think through how that would interface with the linked data, or if it is simply stored and accessed through a different interface.

-- It would be fun to present use case queries that genome SPARQL graph could answer. There is a need to get everyone on the same page about this. Would you and @ekg be interested in presenting on the ref-var call (next one will be Wednesday 30th at 9am pacific/6pm Paris)? I am imagining this world where we have agreed on reference genome graph in which all the common variants are named and we are using linked data to associate variants to biomedical information in a very flexible, powerful way, with the full power of a relational (albeit slow) interface built upon it! I know others (you, Toshiaki, @ekg, etc.), have thought about this for awhile, others of us are just catching up ;)

Variantion graph specific software can achieve higher compression ratio's. However, as you get more and more general by e.g. including random annotation the variation tool will become a more general database and the compression rate will drop.

For VG it is quite possible to run SPARQL directly against the VG data-structure given some coding time. This is not terribly difficult code to write, given an existing open source SPARQL engine, as xg is a data-structure that is extremely similar to straight forward SPARQL disk-stuctures. This would be an interesting project for a hackathon. The main problem is there is no good open source SPARQL engine in C++. But FFI from Ruby, Python, Perl or Java JNI hooking up to VG is a workable approach.

Also this is a minimal conceptual schema, one could compress further by leaving more triples implicit as well as subpath sharing. But I wouldn't advice making the schema more complicated before there is a shown need for it.

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/ga4gh/schemas/issues/572#issuecomment-197203547

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/ga4gh/schemas/issues/572#issuecomment-197393381

JervenBolleman commented 8 years ago

I believe that individual genomes, stored as sets of paths through the graph will need to be stored in a less general, more efficient data structure, because we will have hundreds and ultimately hundreds of thousands of them. Adam has been developing the gPBWT, which is a succinct store for paths on the graph that extends Richard Durbin's PBWT. It would be great to think through how that would interface with the linked data, or if it is simply stored and accessed through a different interface.

Let do some stupid napkin math.

3 billion base pairs nodes are on avg 10 basepairs in VG -> 300 million steps per genome 3 triples per step 25 bytes per triple //observed for VG 1000G in RDF in DB.

300 million * 3 * 25 bytes => 22.5 GB per path/patient/genome

Which for 100,000 genomes would be 2.2. petabytes. Which is way to large.

Take the subpath optimisation, assume 1% divergence from a reference genome. Then you have 3 million steps per genome end up with 22.5 terabyte for 100,000 genomes.

Assume a R2RML approach or CS. You end up something like with 8 TB.

1% divergence is way to high, for an average between humans. That is 100 more than 1000 Genomes results suggest.

Now point out the stupid mistake in my math that shows I am off by an order of magnitude or five ;)

It would be fun to present use case queries that genome SPARQL graph could answer. There is a need to get everyone on the same page about this. Would you and @ekg be interested in presenting on the ref-var call (next one will be Wednesday 30th at 9am pacific/6pm Paris)? I am imagining this world where we have agreed on reference genome graph in which all the common variants are named and we are using linked data to associate variants to biomedical information in a very flexible, powerful way, with the full power of a relational (albeit slow) interface built upon it! I know others (you, @ktym, @ekg, etc.), have thought about this for awhile, others of us are just catching up ;)

SPARQL is a generic query language. So any query you want to do on the VG is a possibility. So any question you wanted to ask with variation graphs should be answerable with SPARQL. My personal interest is to have UniProt annotation from the protein space available on the genome level. i.e. Protein Variation Graphs. Especially disease related data that we have curated.

My second interest is that often we are bandwidth limited when dealing with large remote datasets. So I much prefer uploading 5mb queries over downloading 5tb of data to run a query locally.

I will try to be available Wednesday 30th at 6pm Paris time. That is a rather inconvenient time for me but I will try to make myself available. @benedictpaten could you send the details to my work e-mail?

benedictpaten commented 8 years ago

On Wed, Mar 16, 2016 at 12:51 PM, JervenBolleman notifications@github.com wrote:

I believe that individual genomes, stored as sets of paths through the graph will need to be stored in a less general, more efficient data structure, because we will have hundreds and ultimately hundreds of thousands of them. Adam has been developing the gPBWT, which is a succinct store for paths on the graph that extends Richard Durbin's PBWT. It would be great to think through how that would interface with the linked data, or if it is simply stored and accessed through a different interface.

Let do some stupid napkin math.

3 billion base pairs nodes are on avg 10 basepairs in VG -> 300 million steps per genome 3 triples per step 25 bytes per triple //observed for VG 1000G in RDF in DB.

300 million * 3 * 25 bytes => 22.5 GB per path/patient/genome

Which for 100,000 genomes would be 2.2. petabytes. Which is way to large.

Take the subpath optimisation, assume 1% divergence from a reference genome. Then you have 3 million steps per genome end up with 22.5 terabyte for 100,000 genomes.

Assume a R2RML https://www.w3.org/TR/r2rml/ approach or CS http://homepages.cwi.nl/%7Educ/papers/emergentschema_www15.pdf. You end up something like with 8 TB.

1% divergence is way to high, for an average between humans. That is 100 more than 1000 Genomes results suggest.

Now point out the stupid mistake in my math that shows I am off by an order of magnitude or five ;)

Sounds good, but what is the subpath optimisation?

It would be fun to present use case queries that genome SPARQL graph could answer. There is a need to get everyone on the same page about this. Would you and @ekg https://github.com/ekg be interested in presenting on the ref-var call (next one will be Wednesday 30th at 9am pacific/6pm Paris)? I am imagining this world where we have agreed on reference genome graph in which all the common variants are named and we are using linked data to associate variants to biomedical information in a very flexible, powerful way, with the full power of a relational (albeit slow) interface built upon it! I know others (you, @ktym https://github.com/ktym, @ekg https://github.com/ekg, etc.), have thought about this for awhile, others of us are just catching up ;)

SPARQL is a generic query language. So any query you want to do on the VG is a possibility. So any question you wanted to ask with variation graphs should be answerable with SPARQL. My personal interest is to have UniProt annotation from the protein space available on the genome level. i.e. Protein Variation Graphs. Especially disease related data that we have curated.

My second interest is that often we are bandwidth limited when dealing with large remote datasets. So I much prefer uploading 5mb queries over downloading 5tb of data to run a query locally.

I will try to be available Wednesday 30th at 6pm Paris time. That is a rather inconvenient time for me but I will try to make myself available. @benedictpaten https://github.com/benedictpaten could you send the details to my work e-mail?

Done - I pencil you in for a brief presentation?, I appreciate the time sucks if you're in Switzerland! Sorry. I think the group as whole could do with a very short RDF/linked data primer, and a brief of the preliminary work you've done in creating an RDF vg graph.

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/ga4gh/schemas/issues/572#issuecomment-197515123

JervenBolleman commented 8 years ago

@benedictpaten There is a small primer on RDF in relation to VG in the VGteam wiki.

Subpath optimisation is in CRAM terms just a straight forward reference based compression. For now the above model is very simple. All paths are written out fully and completely. Subpaths would just mean we would no longer write out the full path, but instead just say where it differs from the reference.

benedictpaten commented 8 years ago

On Wed, Mar 16, 2016 at 5:58 PM, JervenBolleman notifications@github.com wrote:

@benedictpaten https://github.com/benedictpaten There is a small primer on RDF in relation to VG in the VGteam wiki https://github.com/vgteam/vg/wiki/RDF:-for-VG.

Awesome!

Subpath optimisation is in CRAM terms just a straight forward reference based compression. For now the above model is very simple. All paths are written out fully and completely. Subpaths would just mean we would no longer write out the full path, but instead just say where it differs from the reference.

I agree that compression is possible, but how do you specify "don't know" if you assume the reference when not specified? Furthermore, an individual is a pair of haplotypes, so you minimally need to double your estimate ;)

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/ga4gh/schemas/issues/572#issuecomment-197626953

JervenBolleman commented 8 years ago

@benedictpaten sorry for not being clear. The compression comes from instead of listing each step individually by explicitly stating now follow X steps on the reference starting at reference step Y. So no assumptions about null or missing values having any meaning. That would be very un RDF like ;)

benedictpaten commented 8 years ago

On Thu, Mar 17, 2016 at 12:45 AM, JervenBolleman notifications@github.com wrote:

@benedictpaten https://github.com/benedictpaten sorry for not being clear. The compression comes from instead of listing each step individually by explicitly stating now follow X steps on the reference starting at reference step Y. So no assumptions about null or missing values having any meaning. That would be very un RDF like ;)

Got it. Thanks!

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/ga4gh/schemas/issues/572#issuecomment-197742524