ga4gh / ga4gh-schemas

Models and APIs for Genomic data. RETIRED 2018-01-24
http://ga4gh.org
Apache License 2.0
214 stars 114 forks source link

Formal definition of the data model needed #380

Open jeromekelleher opened 8 years ago

jeromekelleher commented 8 years ago

Currently, the data model for container types (Datasets, VariantSets and ReadGroupSets) in the GA4GH API is vague and ill defined. Where definitions are given, they refer back to the data file format on which they based. This is fine for structural purposes, but does little to clarify the semantic meaning of the collection. There are good arguments for allowing flexibility in the interpretations of data types, but there are also some major down sides:

  1. Interoperability By leaving the interpretation of data types to server implementors, their methods of partitioning equivalent data among the various types will differ. Therefore, as a client side developer, if I write an analysis of (say) 1000G phase1 data on server implementation A, I may have to substantially rewrite this to run on implementation B.
  2. Content addressability Unless the semantics of what a particular container holds is rigorously defined, it will not be possible to define a method for calculating a content hash (something that is recognised as being highly desirable for many reasons). It's clear how to do this for References, but it is an open problem for all the remaining types.

To start off this discussion I'd like to propose a straw-man model. I'm sure this is overly simplistic, but hopefully it'll be enough to get the discussion started, and we can iterate to something that is (a) flexible enough to be useful in a wide variety of scenarios; and (b) rigorous enough to give us a model we can reason with.

  1. A ReferenceSet is a collection of References, and is defined as it currently is.
  2. A Sample is an identifier associated with some biological observation. This might be an individual human tissue sample, or a bucket of sea water in a metagenomics study.
  3. A DataSet is a collection of primary data for a defined set of Samples, and the inferences made using these data.
  4. A DataSet contains a collection of ReadGroupSets. There may be many ReadGroupSets associated with a given sample (unmapped data, mapped to different references, different mappers/parameters).
  5. A DataSet contains a collection of VariantSets. A VariantSet is a collection of Variants inferred from the data by some methodology.
  6. A Variant contains Calls, which refer to a particular Sample.

(I have omitted the CallSet here, as I've never managed to understand what it was, other than some metadata associated with a sample. )

The general idea is to try to encode the provenance of the inferences made on the primary data explicitly as part of the data model, and to explicitly link up the original samples to the variant calls. As I say, I'm sure this straw man is far too simplistic, but I hope it'll start some discussion from which we'll get a well-defined data model.

diekhans commented 8 years ago

+1 Without rigorous definition of the data model, we will end up in the with highly diverged data, even with the same.

The definition of call set relates to: https://github.com/ga4gh/schemas/issues/379

A `CallSet` is a collection of variant calls for a particular sample.                                    
It belongs to a `VariantSet`. This is equivalent to one column in VCF.                                   

One could also have multiple call sets for a given sample, perhaps predicted with different variant call algorithms. So it is somehow related to the provenance of the call

lh3 commented 8 years ago

At present, a submitter submits sequencing data to one of SRA/ENA (or GenBank/EMBL/DDBJ for some other types of data). The data are then synced periodically. For the same type of data, the submitter don't submit twice to different repositories. ReadGroup/ReadGroupSet/VariantSet/DataSet are consistent across the world. In contrast, right now, it is the GA4GH developers who are defining ReadGroup/etc. Interoperability may be broken often. On this problem, we should stick with the old way, letting submitters decide the scope of these concepts so that there is a single source of truth.

CallSet is not well defined from the very beginning. The initial intention was to incrementally add samples to an existing VariantSet. This is not how things are working now and won't work in future unless we switch to a binary/unary variant model. Nonetheless, CallSet has one bit correct: (datasetId,sampleName) uniquely identifies a sample. I have long advocated to rename CallSet (when we say "call set" in my circle, we almost never mean the "CallSet" defined by GA4GH) and promote it to common.avdl such that reads and variants refer to the same sample the same way.

Realistically, a VariantSet does not belong to a DataSet. Think about HRC/ExAC. They make one giant VariantSet across many DataSets. It may be tempted to define a new super DataSet, but we don't have mechanism to keep track of their relationships. In addition, as @sguthrie said, it is not rare for two VariantSet/VCF to share some samples. Defining super DataSet won't tell you which two VariantSets have the same samples. In both cases, the (datasetId,sampleName) pair still works. That is an advantage of this approach. Another advantage is that the pair abstracts away the biological complication of defining a sample.

jeromekelleher commented 8 years ago

Thanks for this @lh3, it's very helpful to get your input. I'm sure the model I've outlined above is naive and wouldn't work in the real world. I have no attachment to it at all. What you've outlined in #376 sounds like a good start in tying the data we have together. I don't think we should be afraid to lose the CallSet concept at this point --- I think in practise it would have very little impact on client code if it was replaced.

diekhans commented 8 years ago

CallSet or replacement is needed for provenance track. What program was used to create these calls? CallSet mapping to multiple variantSets and having sampleId null, and not way to indicate what readgroups and alignments the calls were done against seems to not represent the data.

This needs to be though through based on use cases where N is >

  1. Multiple mapping algorithms, multiple calling algorithms. Can they be represented?

Jerome Kelleher notifications@github.com writes:

Thanks for this @lh3, it's very helpful to get your input. I'm sure the model I've outlined about is naive and wouldn't work in the real world. I have no attachment to it at all. What you've outlined in #376 sounds like a good start in tying the data we have together. I don't think we should be afraid to lose the CallSet concept at this point --- I think in practise it would have very little impact on client code if it was replaced.

— Reply to this email directly or view it on GitHub.*

mlin commented 8 years ago

CallSet is not well defined from the very beginning. The initial intention was to incrementally add samples to an existing VariantSet. This is not how things are working now and won't work in future unless we switch to a binary/unary variant model.

This is interesting, I didn't know this original intention. If I can extrapolate your last comment, the reason why fulfilling this intention is problematic is because incrementally adding newly generated call sets to a variant set is a similar problem to merging VCF files, which is a difficult operation to perform non-fallaciously due to issues such as ref/no-call and overlapping alleles/indels, whereas merging unary allele data is more straightforward.

Perhaps one other way to imagine fulfilling that intention would be if all samples are genotyped against a predefined panel of sites & alleles. That actually seems to me like a plausible mode of operation for certain kinds of studies but it wouldn't be good to overcommit the data model to those kinds.

AAMargolin commented 8 years ago

I think it's a mistake to conflate the specific representation of of how variants/variant sets are represented with the general definition of DataSet.

If we start with @jeromekelleher 's suggestion that a DataSet is defined by a common set of samples, the next thing needed is to define a common FeatureSpace (or multiple FeatureSpaces, provided each are implemented by each Object in the DataSet).

Then, for a given FeatureSpace (say Gene IDs), the dataset would be a hypercube with rows corresponding to features, columns corresponding to samples, and K Objects for each (feature, sample), corresponding to the N datatypes in the DataSet. This would be:

                 Sample_1  Sample_2 .... Sample_N

Feature_1
Feature_2 ... Feature_M

Let's say we have K=3 data types {DNA mutation, RNA expression, DNA copy number}

DataSet would define a method like getDataObject(featureId, sampleId, objectId), which would return an object type corresponding to the type of data.

So: getDataObject(featureId, sampleId, 1) might return a CallSet (or whatever is the representation of mutation calls for a given feature, sample -- I get confused)

Note I'm defining the elements within the DataSet as Objects (rather than numbers) to remain flexibility of more complex representations and operations that can be performed on them. The key here would be that each Object within the dataset needs to define a method such as getValue, which returns a float.

Then, DataSet would allow for methods such as

getValue(featureId, sampleId, objectId, params) which would call getDataObject(featureId, sampleId, objectId).getValue(params).

Similarly, you could get the matrix of values for a given data type by calling: getValues(objectId, params), which calls getValue for every {featureId, sampleId} and aggregates them into a matrix

Or you could get the hypercube of values for all data types by calling getValues(null, {params}) which calls getValue for every {featureId, sampleId, objectId} and aggregates them into a tensor.

If params is null there would be a default getValue method, or params could specify alternate ways of returning values (say different thresholds for significance of variant calls).

Defining FeatureSpaces is a tricky issue when we aggregate different data types (say variant level and gene level), though @kellrott has some good thoughts on this.

I can try to describe more clearly what I'm thinking if this doesn't make sense.

Adam

diekhans commented 8 years ago

The lack of a clear documented data model continues to cause problems with the GA4GH APIs. Initial implementations of Variant Annotations, RNA, and G2P all had issues that had to be correct because they didn't follow the undocumented design paradigms.

The dataset object continues to confuse developers. It's purpose, use, and place in the data model is not well documented.

pgrosu commented 8 years ago

Hi Mark,

I understand, though I think we are progressing towards that, as we are still sub-1.0 in our GA4GH version. It might a good idea if we all sit down and review completely by specifying all the features and goals before defining the acceptance criteria via Behaviour-Driven Development (BDD) of an API (schemas). It is only after that, that we can attempt with actualizing the integrated Data Model(s). You might remember my recommendations through two previous posts - which I provided links for below - on how we should go about designing our (GA4GH) API/schemas, and the natural progression towards Data Model(s):

https://github.com/ga4gh/schemas/issues/342

https://github.com/ga4gh/schemas/issues/323#issuecomment-110227872

This is the same process by which @dglazer went about at the beginning in the following post regarding ReadGroups, that allowed for downstream agreement in the implemented schema after several discussions:

https://github.com/ga4gh/schemas/issues/24#issuecomment-41000910

As a neutral comparison, the discussion here started logically with the features and goals, and then jumped directly to data models, which skipped many critical steps that lead to our current state of some confusion and frustration. I would recommend that we start the process properly to ensure we are covering all the basis and avoid surprises in the future. Once we have a clear description across the board of the features and goals we want, then we can define the behavior of all the schemas, after which the information flow and data models naturally follow to provide consistency in the API.

Hope these are all points of common agreement.

Let me know what you think.

Thanks, Paul

diekhans commented 8 years ago

The following Unresolved Issues were added to apidesign_intro.rst during the docathon. Adding these to this ticket as issues that need to be documented.

pgrosu commented 8 years ago

Hi Mark,

When did this docathon happen as this is the first time I heard of it. In any case, below answers to all the questions:

1) The GA4GH object design is the Protocol Buffer message type used for transmitting genomic data. Yes, it would be the recommended way for transmitting/streaming data as we are using gRPC as a framework to transmit it. gRPC is a server-client framework to transmit messages, which I will describe below. GA4GH messages formatted as Protocol Buffer messages, which are basically key-value pairs whose message structure gets encoded.

Protocol Buffers

Protocol Buffers are a way of serializing a message for transmission over the wire with good compression of the message, as well decoding a received message. Let's take a small example:

message ReadAlignment {
 ...
 string id = 1;
 ...
}

In this example, the key that would be encoded as a combination of field number (denoted as 1 above), and wire type (denoted as 2 below, representative of a string). Below are the list of wire types for Protocol Buffer field types:

wiretypes

The value would be the assigned value to this field, which can itself be encoded.

To encode and decode a message that would get too technical to post here, and if you are interested please visit Protocol Buffer Encoding/Decoding. Please feel free to ask if anything requires more clarification.

gRPC

Now that we have a way to encode and decode a message, we need to a way to have a server to request such messages from and a client to query such a server. You might have seen service messages defined as:

service ReadService {
  ...
  rpc SearchReads(SearchReadsRequest) returns (SearchReadsResponse);
}

The service and rpc portions will create a gRPC service interface. Below is the gRPC diagram, for which I will describe the process afterwards:

grpc_concept_diagram_00

The basic idea of a RPC (Remote Procedure Call) is it allows a program to request a service from another one located on a network without dealing with the network details. A Client creates a Channel to a Server which encodes/decodes the messages, which in our case are Protocol Buffer messages. To work with the Channel, a Client creates a Stub instance using the Channel to communicate over and it is easier to think of a Stub as the Client to communicate via a Channel.

The aspect that attracts us to gRPC is that it uses HTTP/2 - which was originally called HTTP/2.0 - and allows for bidirectional concurrent streaming of messages.

2) Yes we can use an interceptor on the Client for intercepting outgoing calls to a Server via a field mask, in order to retrieve a subset of fields of in a request that would go over a Channel. Again it depends on how we want to implement it, as we have a lot of flexibility in how we design the distributed global backend. ReadAlignments and References for instance would not change. Variants that are not generated on-the-fly and loaded into the global multi-center repositories would also not be changed. Again everything has an id to tie things together or nest them by association.

3) Objects are immutable in how they are defined, as for example with References:

// `Reference`s are designed to be immutable.

message Reference {

  // The reference ID. Unique within the repository.
  string id = 1;
...

Data that gets loaded has an id or grouped to something that has an id, or can be connected by a relationship table of one set of object to others by a set of respective ids. An id defines immutability and mutability by adding new versions with new ids.

Hope this helps answer any questions anyone might have had, and please feel free to ask if anything requires more clarification.

Thank you, Paul

diekhans commented 8 years ago

The docathon happen last November. Much more documentation work remains to be done.

Given the current slow rate of acceptance, using gRPC is not currently desirable. It requires patches to both web severs and client libraries.

An approach that address the current state of software and is forward looking would be to use protocol negotiation. With HTTP 1.1, using chunked transfer encoding, with HTTP 2.0 use gRPC.

gRPC would not work for people who need a pure HTTP/JSON client.

Paul Grosu notifications@github.com writes:

Hi Mark,

When did this docathon happen as this is the first time I heard of it. In any case, below answers to all the questions:

1) The GA4GH object design is the Protocol Buffer message type used for transmitting genomic data. Yes, it would be the recommended way for transmitting/streaming data as we are using gRPC as a framework to transmit it. gRPC is a server-client framework to transmit messages, which I will describe below. GA4GH messages formatted as Protocol Buffer messages, which are basically key-value pairs whose message structure gets encoded.

Protocol Buffers

Protocol Buffers are a way of serializing a message for transmission over the wire with good compression of the message, as well decoding a received message. Let's take a small example:

message ReadAlignment { ... string id = 1; ... }

In this example, the key that would be encoded as a combination of field number (denoted as 1 above), and wire type (denoted as 2 below, representative of a string). Below are the list of wire types for Protocol Buffer field types:

wiretypes

The value would be the assigned value to this field, which can itself be encoded.

To encode and decode a message that would get too technical to post here, and if you are interested please visit Protocol Buffer Encoding/Decoding. Please feel free to ask if anything requires more clarification.

gRPC

Now that we have a way to encode and decode a message, we need to a way to have a server to request such messages from and a client to query such a server. You might have seen service messages defined as:

service ReadService { ... rpc SearchReads(SearchReadsRequest) returns (SearchReadsResponse); }

The service and rpc portions will create a gRPC service interface. Below is the gRPC diagram, for which I will describe the process afterwards:

grpc_concept_diagram_00

The basic idea of a RPC (Remote Procedure Call) is it allows a program to request a service from another one located on a network without dealing with the network details. A Client creates a Channel to a Server which encodes/ decodes the messages, which in our case are Protocol Buffer messages. To work with the Channel, a Client creates a Stub instance using the Channel to communicate over and it is easier to think of a Stub as the Client to communicate via a Channel.

The aspect that attracts us to gRPC is that it uses HTTP/2 - which was originally called HTTP/2.0 - and allows for bidirectional concurrent streaming of messages.

2) Yes we can use an interceptor on the Client for intercepting outgoing calls to a Server via a field mask, in order to retrieve a subset of fields of in a request that would go over a Channel. Again it depends on how we want to implement it, as we have a lot of flexibility in how we design the distributed global backend. ReadAlignments and References for instance would not change. Variants that are not generated on-the-fly and loaded into the global multi-center repositories would also not be changed. Again everything has an id to tie things together or nest them by association.

3) Objects are immutable in how they are defined, as for example with References:

// References are designed to be immutable.

message Reference {

// The reference ID. Unique within the repository. string id = 1; ...

Data that gets loaded has an id or grouped to something that has an id, or can be connected by a relationship table of one set of object to others by a set of respective ids. An id defines immutability and mutability by adding new versions with new ids.

Hope this helps answer any questions anyone might have had, and please feel free to ask if anything requires more clarification.

Thank you, Paul

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.*

pgrosu commented 7 years ago

So 1,000 Genomes and similar datasets are becoming minuscule and common-place as there are 7.4 billion in terms of human genomes, and we have many more different organisms which are being sequenced every day.

Remember last year I posted a link to an article here (https://github.com/ga4gh/schemas/issues/308) about the first patients diagnosed with the 100,000 genomes project. Just to give you an idea, the total read-count in 1000 genomes based on this link is about 2.47123*(10^14) reads:

ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000_genomes_project/1000genomes.sequence.index

Do we really expect to stay with chunked transfers? We have to adapt to newer technologies, and there are many ways we can tweak and improve the internals of Protocol Buffers and gRPC, but we definitely cannot go backwards. I mean just look at Cap'n Proto as an example.

People cannot wait months for data-transfers to finish processing, or for days or weeks for analysis to complete. How quickly will this scale and be practical when dealing with 1 billion genomes?

Protocol Buffers can work with JSON-encoded data as noted here:

https://developers.google.com/protocol-buffers/docs/proto3#json

Since gRPC is currently based on Protocol Buffers - and is designed to be extensible to support multiple content types - working with at least JSON via gRPC this currently now possible.