ga4gh / ga4gh-schemas

Models and APIs for Genomic data. RETIRED 2018-01-24
http://ga4gh.org
Apache License 2.0
214 stars 110 forks source link

Dataset role not clearly defined #248

Closed jeromekelleher closed 9 years ago

jeromekelleher commented 9 years ago

Currently, the Dataset type is poorly specified, and included in reads.avdl. We have some comments like

TODO: Reads and variants both want to have datasets. Are they the same object?

This needs to be clarified.

I propose:

  1. We change the name to DataSet, so that we are consistent with VariantSet, ReferenceSet, ReadGroupSet and others.
  2. We make the role of the DataSet much more explicit, and make a formal definition of the data model as a hierarchy. (This relates to #247 also, since my proposal there assumes a hierarchy.) In this hierachy, theDataSet is the root, and it has child VariantSets and ReadGroupSets. Each VariantSet has child Variants; each ReadGroupSet has child ReadGroups, and ReadGroups have child Reads. We don't need to make any changes to the API to do this, we just need to be more explicit about what the model means.

This formalisation should make reasoning about authorisation much more straightforward.

Thoughts?

jeromekelleher commented 9 years ago

This hierarchy idea is broken somewhat by the Call and CallSet classes:

  1. CallSet has a field variantSetIds, so it can belong to more than one VariantSet. Do we really want CallSets to belong to more than one VariantSet? It would be a lot simpler if we had one call belonging to exactly one Variant and one CallSet.
  2. SearchCallsRequest is badly overloaded: we can search for a calls contained in a list of callSetIds, variantSetIds and variantIds. We should replace all of these with a single callSetId, since (a) a CallSet belongs to a VariantSet, so specifying the vaiantSetId is redundant; and (b) obtaining the Calls associated with a given Variant should be done directly using a SearchVariantsRequest. We should ask what the point of a SearchCallsRequest is at all in this case, as the idea of getting all the Calls in a CallSet without reference to their parent Variants seems somewhat obscure...
  3. SearchAlleleCalls suffers from similar problems, having four different lists of IDs that we can provide as parameters.
dcolligan commented 9 years ago

I would also like to know what function the Dataset was supposed to serve.

richarddurbin commented 9 years ago

It encapsulates a set of data with shared access control (most important) and typically provenance. For example from a single deposited VCF. There may also be namespace implications, e.g. sample names are unique in a dataset, but not necessarily between. Something like this is necessary in practice when data from more than one source is stored in a server.

Richard

On 14 May 2015, at 21:07, Danny Colligan notifications@github.com wrote:

I would also like to know what function the Dataset was supposed to serve.

— Reply to this email directly or view it on GitHub https://github.com/ga4gh/schemas/issues/248#issuecomment-102152607.

The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE.

calbach commented 9 years ago

I agree with what @richarddurbin said. To add to it, the role is much clearer for reads where there would otherwise be no hierarchical collection beyond a ReadGroupSet. The value of a Dataset for variants was somewhat mitigated by the introduction of VariantSets, which serve a similar purpose. Also the intent was that they are the same object across VariantSets and ReadGroupSets. It is natural to have reads and related variants for a particular study in the same dataset, for ease of sharing and management.

CallSet has a field variantSetIds, so it can belong to more than one VariantSet. Do we really want CallSets to belong to more than one VariantSet? It would be a lot simpler if we had one call belonging to exactly one Variant and one CallSet.

I didn't realize this was the case. I don't think CallSets should be able to span VariantSets; maybe I'm missing some background information as to why this would be desirable.

pgrosu commented 9 years ago

@calbach Will the same sample be part of different VariantSets? Can two VariantSets be technical replicates in a larger study?

diekhans commented 9 years ago

If it's specific to reads, then it should have a less generic name.

If it's intended to be more general, it should be moved out of reads.

What ever is intended, this intention needs to be documented.

CH Albach notifications@github.com writes:

I agree with what @richarddurbin said. To add to it, the role is much clearer for reads where there would otherwise be no hierarchical collection beyond a ReadGroupSet. The value of a Dataset for variants was somewhat mitigated by the introduction of VariantSets, which serve a similar purpose. Also the intent was that they are the same object across VariantSets and ReadGroupSets. It is natural to have reads and related variants for a particular study in the same dataset, for ease of sharing and management.

CallSet has a field variantSetIds, so it can belong to more than one
VariantSet. Do we really want CallSets to belong to more than one
VariantSet? It would be a lot simpler if we had one call belonging to
exactly one Variant and one CallSet.

I didn't realize this was the case. I don't think CallSets should be able to span VariantSets; maybe I'm missing some background information as to why this would be desirable.

— Reply to this email directly or view it on GitHub.*

pgrosu commented 9 years ago

I'll reference myself from what I posted previously under server: https://github.com/ga4gh/server/issues/376#issuecomment-101865761

So think of a Dataset as study that was performed and which joins/groups all the relevant data associated with it in a logical grouping. It is a collection of one or more of the following:

It also encompasses what @richarddurbin is emphasizing. Remember this diagram I posted almost a year ago here:

sample read variant connected workflow structure

Notice how the permissions propagate down to the Variant level, which is initiated by the user's permission that generated the Project/Study. At any time these can be turned to public so everyone can view them. So by utilizing controlled vocabulary through which one can reference other samples, then one can link multiple datasets together. Thus yes, the ability to query multiple datasets is important since one dataset cannot contain all the relevant data that one will ever need. As previously suggested in #253 you will need the ability to query multiple datasets, especially for metadata among other things. Also having one call belonging to exactly one Variant and one CallSet might make indexing costly, though there are other approaches that would cut down those costs. We are still below version 1.0, and adding too many limitations would limit the potential of the possibilities of what we can do with the API.

Hope it helps, Paul

dglazer commented 9 years ago

Agreeing with and elaborating on @richarddurbin and @calbach 's responses to @dcolligan 's question:

I would also like to know what function the Dataset was supposed to serve.

TL;DR: they're a convenient way for data providers to lump together related data of multiple types.

mbaudis commented 9 years ago

@pgrosu's comment reflects my take on discussions we had in metadata (e.g. see comment thread in https://docs.google.com/document/d/1Sl6FYwBHjWYYo2Ex29fqvmEiOQjeokgAUfXNGsVQwZ0/edit#heading=h.p2r9dh51rf8 ). I was proposing a however called generic group object, to represent a static or dynamically generated group of data objects (e.g. all samples or analyses matching a given set of criteria). Currently, there is just a "IndividualGroup" record type, which is rather conservative in mainly referring to "group of individuals e.g. a trio". I would love to see some traction to define a consistent mechanism/object type for this.

diekhans commented 9 years ago

The problem is that the purpose is not documented in the schema. It should either be a pull request with documentation or one for it's removal.

If it's to apply to more than reads, it needs to be moved out of reads.

It also seems inadequate for any of the these goals, so there really needs to be some use cases investigated:

I am have now convinced myself that dataset should be removed.

Also, having objects reference their containers (datasetId) is inflexible and cumbersome. We would be far better off with a functional representation of the data model where objects are not bound to their containers.

Mark

David Glazer notifications@github.com writes:

Agreeing with and elaborating on @richarddurbin and @calbach 's responses to @dcolligan 's question:

I would also like to know what function the Dataset was supposed to serve.

TL;DR: they're a convenient way for data providers to lump together related data of multiple types.

• for access control: if a server wants to host data with heterogeneous access (e.g. a public copy of 1000genomes and a private copy of one researcher's study), there needs to be a place for data providers to specify who can see what. Datasets are a convenient level of granularity for doing so. • for billing: if a server wants to charge multiple data providers for the resources needed to store their data, it needs to be clear who gets charged for which data. Datasets are a convenient level of granularity for doing so. • for provenance: the server doesn't care about where data comes from (at least not with today's methods), but users do. Datasets are a convenient way to group all data from one logical source (e.g. a study).

— Reply to this email directly or view it on GitHub.*

pgrosu commented 9 years ago

So maybe we might want to start discussing how do we really want to use this API and its ideal purpose? Would it be to enable better elucidation of diseases for research purposes, or what might be some of the billing and dataset-as-permission incentives? The reason I pose this question is that, usually it is better to look at all the data than just part of it. If different individuals/teams have different access to it, that would make it difficult to properly identify the causes of some diseases in order to properly perform the followup experiments or recommend treatment. Imagine if BLAST - when it was first introduced in the 90's - provided varied permission levels to different users with different billing structures. Ideally in our case, the individuals would be anonymized so that access to all the data can still be used for research purposes. This would be the ideal way to effectively enable the system for both research and personalized medicine.

Paul

richarddurbin commented 9 years ago

I am concerned about removing Dataset without replacing it with something that has already been proved to work better.

I also found DataSet vague when I originally came across it, and wanted to change or remove it. But the reason it is there is that Google, who have a working read and variant repository across multiple studies/projects, needed this wrapper concept to support their real world system. Then I realised that we have exactly the same thing within Sanger - we call it a Study. I conclude that it is essential to have a wrapper concept for practical genetic variation data repositories, and that despite all sorts of theorising about what might be ideal, a single layer wrapper as provided by DataSet is practically sufficient in quite complex settings.

I vote strongly for experience, and not removing something known to work without replacing it and convincing people who actually manage diverse repositories that the replacement is better.

Richard

On 20 May 2015, at 05:18, Mark Diekhans notifications@github.com wrote:

The problem is that the purpose is not documented in the schema. It should either be a pull request with documentation or one for it's removal.

If it's to apply to more than reads, it needs to be moved out of reads.

It also seems inadequate for any of the these goals, so there really needs to be some use cases investigated:

  • access control - one would have a very hard time implementing TCGA's access control policy with dataset as key for access control. It would also mean that data would move between dataset during the life of the project.
  • billing - vendor-specific tasks, such as billing, should be outside of the scope of GA4GH. Vendor's should have the flexibility to design policies of their choosing and the API should not be cluttered with things that might prove useful to someone.
  • provenance - an extremely important problem and something that needs a mechanism that has finer granularity than dataset. Having this in here because it might prove useful to provenance isn't a good way to implement this critical feature.

I am have now convinced myself that dataset should be removed.

Also, having objects reference their containers (datasetId) is inflexible and cumbersome. We would be far better off with a functional representation of the data model where objects are not bound to their containers.

Mark

David Glazer notifications@github.com writes:

Agreeing with and elaborating on @richarddurbin and @calbach 's responses to @dcolligan 's question:

I would also like to know what function the Dataset was supposed to serve.

TL;DR: they're a convenient way for data providers to lump together related data of multiple types.

• for access control: if a server wants to host data with heterogeneous access (e.g. a public copy of 1000genomes and a private copy of one researcher's study), there needs to be a place for data providers to specify who can see what. Datasets are a convenient level of granularity for doing so. • for billing: if a server wants to charge multiple data providers for the resources needed to store their data, it needs to be clear who gets charged for which data. Datasets are a convenient level of granularity for doing so. • for provenance: the server doesn't care about where data comes from (at least not with today's methods), but users do. Datasets are a convenient way to group all data from one logical source (e.g. a study).

— Reply to this email directly or view it on GitHub.*

— Reply to this email directly or view it on GitHub https://github.com/ga4gh/schemas/issues/248#issuecomment-103750191.

The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE.

mbaudis commented 9 years ago

A flexibel way to define datasets/studies would be an object containing query & access parameters:

There is IMHO no real difference between a stored query and a "DataSet"; the latter would be a static version of this query style object (e.g. defined through the fixed object ids).

pgrosu commented 9 years ago

+1 with @richarddurbin, @mbaudis using a dataset/object-as-a-study - which I've had the same experience with in practice, and the reason I mentioned it last year - but then we should not limit the search across multiple datasets in https://github.com/ga4gh/schemas/pull/253.

diekhans commented 9 years ago

@richarddurbin I am all for experience, however there is no experience for us to learn from because it's purpose and scope of DataSet is not documented.

The goal of the DWG needs to be limited to providing a data model and API for data exchange if it's going to be successful. Some things needed by vendor will be out-of-scope, however their requirements are also very valuable input.

The Study analogy is very compelling. However, there is no way to know, since DataSet is buried in reads.avdl and not documented. An API must be completely implementable from the IDL, documentation, and conformance suite.

Mark

Richard Durbin notifications@github.com writes:

I am concerned about removing Dataset without replacing it with something that has already been proved to work better.

I also found DataSet vague when I originally came across it, and wanted to change or remove it. But the reason it is there is that Google, who have a working read and variant repository across multiple studies/projects, needed this wrapper concept to support their real world system. Then I realised that we have exactly the same thing within Sanger - we call it a Study. I conclude that it is essential to have a wrapper concept for practical genetic variation data repositories, and that despite all sorts of theorising about what might be ideal, a single layer wrapper as provided by DataSet is practically sufficient in quite complex settings.

I vote strongly for experience, and not removing something known to work without replacing it and convincing people who actually manage diverse repositories that the replacement is better.

Richard

On 20 May 2015, at 05:18, Mark Diekhans notifications@github.com wrote:

The problem is that the purpose is not documented in the schema. It should either be a pull request with documentation or one for it's removal.

If it's to apply to more than reads, it needs to be moved out of reads.

It also seems inadequate for any of the these goals, so there really needs to be some use cases investigated:

  • access control - one would have a very hard time implementing TCGA's access control policy with dataset as key for access control. It would also mean that data would move between dataset during the life of the project.
  • billing - vendor-specific tasks, such as billing, should be outside of the scope of GA4GH. Vendor's should have the flexibility to design policies of their choosing and the API should not be cluttered with things that might prove useful to someone.
  • provenance - an extremely important problem and something that needs a mechanism that has finer granularity than dataset. Having this in here because it might prove useful to provenance isn't a good way to implement this critical feature.

I am have now convinced myself that dataset should be removed.

Also, having objects reference their containers (datasetId) is inflexible and cumbersome. We would be far better off with a functional representation of the data model where objects are not bound to their containers.

Mark

David Glazer notifications@github.com writes:

Agreeing with and elaborating on @richarddurbin and @calbach 's responses to @dcolligan 's question:

I would also like to know what function the Dataset was supposed to serve.

TL;DR: they're a convenient way for data providers to lump together related data of multiple types.

• for access control: if a server wants to host data with heterogeneous access (e.g. a public copy of 1000genomes and a private copy of one researcher's study), there needs to be a place for data providers to specify who can see what. Datasets are a convenient level of granularity for doing so. • for billing: if a server wants to charge multiple data providers for the resources needed to store their data, it needs to be clear who gets charged for which data. Datasets are a convenient level of granularity for doing so. • for provenance: the server doesn't care about where data comes from (at least not with today's methods), but users do. Datasets are a convenient way to group all data from one logical source (e.g. a study).

— Reply to this email directly or view it on GitHub.*

— Reply to this email directly or view it on GitHub https://github.com/ga4gh/ schemas/issues/248#issuecomment-103750191.

The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE.

— Reply to this email directly or view it on GitHub.*

lh3 commented 9 years ago

How about removing Dataset from reads.avdl and adding Study to common.avdl as a replacement? A study is user/submitter defined, loosely equivalent to a project. A central requirement of Study is that sample names must not be duplicated in a study. BTW, Study is a long existing concept in SRA/ENA.

mbaudis commented 9 years ago

@lh3 +1 for removing Dataset from reads; but after that starts to get hairy.

Writing this as one of the metadata people, salted with personal opinions.

diekhans commented 9 years ago

I actually agree that DataSet is better than term, but the documentation should explain how this relates to study.

I would favor moving directly to metadata. Common needs to go away and be broken into functional modules. Having a dumping ground module does not help in understanding the API.

Michael Baudis notifications@github.com writes:

@lh3 +1 for removing Dataset from reads; but after that starts to get hairy.

• Using the name "Study" is IMHO okish, but "Dataset" seems even better (more neutral; Study implies an activity, whereas Dataset or something like this just is a wrapper for somehow related data objects). • Placing it (e.g. Dataset) into common.avdl seems at the moment o.k., but questions (probably rightfully!) the existence of metadata.avdl. This leads to the general design decision we have to make: Do we want to have a) several collections of loosely connected record definitions like now (-3), or b) do we want to have everything which is not very specifically bound to a certain object type (like e.g. [paraphrasing] library to experiment) in a single document (+3), or c) do we want to populate the schema space with per-record/object files (+2).

Writing this as one of the metadata people, salted with personal opinions.

— Reply to this email directly or view it on GitHub.*

mbaudis commented 9 years ago

@diekhans As I said: "metadata - everything but the sequence"...

So, we should define Dataset (DataSet?) inside metadata.avdl, also replacing IndividualGroup?! Absolutely in favo(u)r.

fnothaft commented 9 years ago

I actually agree that DataSet is better than term, but the documentation should explain how this relates to study.

+1

I would favor moving directly to metadata. Common needs to go away and be broken into functional modules. Having a dumping ground module does not help in understanding the API.

+1

mbaudis commented 9 years ago

So now seeming to understand @diekhans: This is not against a "catch it all", but for existence of OTRTA (one to rule ...), that is metadata.avdl?

lh3 commented 9 years ago

I actually prefer this concept not to be too generic, but anyway, I don't really mind naming or where to put Dataset/Study. I only want to see: 1) a global Dataset/Study object and 2) sample names appearing in reads/refVar are not duplicated in a Dataset/Study.

helenp commented 9 years ago

+1 moving to Meta Data as well. My experience is that you always need a container of some kind. I prefer DataSet. Study has a design associated in my view.

On 20/05/2015 22:10, Michael Baudis wrote:

@lh3 https://github.com/lh3 +1 for removing |Dataset| from reads; but after that starts to get hairy.

  • Using the name "Study" is IMHO okish, but "Dataset" seems even better (more neutral; Study implies an activity, whereas Dataset or something like this just is a wrapper for somehow related data objects).
  • Placing it (e.g. Dataset) into common.avdl seems at the moment o.k., but questions (probably rightfully!) the existence of metadata.avdl. This leads to the general design decision we have to make: Do we want to have a) several collections of loosely connected record definitions like now (-3), or b) do we want to have everything which is not very specifically bound to a certain object type (like e.g. [paraphrasing] library to experiment) in a single document (+3), or c) do we want to populate the schema space with per-record/object files (+2).

Writing this as one of the metadata people, salted with personal opinions.

— Reply to this email directly or view it on GitHub https://github.com/ga4gh/schemas/issues/248#issuecomment-104040360.

Helen Parkinson, PhD Team Leader

Samples, Phenotypes and Ontologies Team

European Bioinformatics Institute (EMBL-EBI) European Molecular Biology Laboratory Wellcome Trust Genome Campus Hinxton Cambridge CB10 1SD United Kingdom

EBI 01223 494672 For scheduling assistance please contact Lynn French lfrench@ebi.ac.uk, 01223 494 453 Skype: helen.parkinson.ebi http://www.ebi.ac.uk/about/people/helen-parkinson

mbaudis commented 9 years ago

O.k., I'll submit something to the metadata team discussion, before doing the PR on metadata.avdl. We'll have to discuss if

/**
Represents a group of data objects of one or more types (e.g. all Individuals, Samples, Experiments 
associated with a clinical study; or e.g. a trio in genetic diagnostics.)
*/
record DataSet {
  /** The dataset UUID. This is globally unique. */
  string id;

  /** The name of the dataset. */
  union { null, string } name = null;

  /** A description of the dataset. */
  union { null, string } description = null;

  /**
  The time at which this record was created. 
  Format: ISO 8601, YYYY-MM-DDTHH:MM:SS.SSS (e.g. 2015-02-10T00:03:42.123Z)
  */
  string recordCreateTime;

  /**
  The time at which this record was last updated.
  Format: ISO 8601, YYYY-MM-DDTHH:MM:SS.SSS (e.g. 2015-02-10T00:03:42.123Z)
  */
  string recordUpdateTime;

  /** The type of dataset. Examples could be "trio", "metaanalysis", "gwas" ...*/
  union { null, string } type = null;

  /** The uuid's of included records. */
  array<string> recordsIncluded = [];

  /** The query leading to a dynamic assignment of dataset members.
  This is just a placeholder for a yet-to-be-defined query object
  union { null, metadataQuery } datasetQuery = null;
  */

  /**
  A map of additional individual group information.
  */
  map<array<string>> info = {};
}
vadimzalunin commented 9 years ago

+1 on Dataset being closer to SRA Study Generic Dataset definition is a potential leaky abstraction. Google's Datasets may be so different from SRA Datasets that clients would have to take this into account.

richarddurbin commented 9 years ago
  1. I am happy to call it DataSet
  2. I am happy for it to move to MetaData
  3. I support strongly Heng’s proposal that sample names are unique within a DataSet
  4. I strongly want it to cover more than one record type. In more detail: 4a) Some types should only be able to be included in one DataSet. Perhaps they should be required to be in a DataSet. An example would be ReadGroup. 4b) I think the same is true for Sample, or at least some sort of Sample record. But then there needs to be a mechanism to identify samples that correspond to the same individual across DataSets. This might be via some other sort of global Individual record, or via relationships. Probably a metadata question. Not immediately critical to resolve, though important. 4c) Reference sequence records and Alleles should not have to belong to a DataSet, nor should Alleles, but the various types of Call should. I am not sure whether DataSets could declare additional private References/Allelles - perhaps.
  5. I would prefer it to be defined directly by specifying constituents, not through a dynamic query.
  6. I see this as some sort of data space in which a data owner/provider would have the rights to create entries, and control access to entries. I would also, at least for now, have access control at the level of DataSet. Perhaps people could specify different access controls for different record types in the same data set.
  7. Along these lines, I think Vadim’s suggestion of using DataSet for SRA Studies is important. We should develop the model so that SRA/ERA can provide their data via the GA4GH API using the DataSet model for their Studies. This is an important constraint that will help us get practical things in place. It does not constrain others who use the model to use it exactly the same way. e.g. different 1000 Genomes populations are in separate ERA studies (I believe), but the access controls are all the same (open) and someone else providing an analysis on the data might put them in one DataSet in their server.

On 21 May 2015, at 10:41, Vadim Zalunin notifications@github.com wrote:

+1 on Dataset being closer to SRA Study Generic Dataset definition is a potential leaky abstraction. Google's Datasets may be so different from SRA Datasets that clients would have to take this into account.

— Reply to this email directly or view it on GitHub https://github.com/ga4gh/schemas/issues/248#issuecomment-104203989.

The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE.

mbaudis commented 9 years ago

Thanks @richarddurbin for the extensive comments. We'll limit the definition to a list of uuid, and leave the query option out of this - one can always have an extrinsic implementation for creating query based "dynamic" datasets, collections, without promoting the mechanism upfront.

Ad 4b: Individuals and samples (and every other record) have to receive their own uuid at creation time. An Individual will have the same uuid in all different datasets, as long as it is not re-created. Problems will arise implementation wise, in tracking relations between different records - while we (will) have mechanisms in place (e.g. collecting "derivedFrom" and such), one can easily foresee multiple entry points into the system, especially for samples (i.e. new sample for existing individual and re-creation of new individual). But these are data management issues, which we can limit through good schema design, but which cannot completely be avoided.

Ad 4c: While references should not be part of the DataSet, included records may point to specific references; but those pointers are part of the lower level records, not the DataSet itself (?).

lh3 commented 9 years ago

Users would want to know: 1) given a Dataset, which samples it has? and conversely 2) given a sample name (e.g. NA12878), which Datasets have this sample? In addition, do we allow a VariantSet to span multiple Datasets? What does a "record" mean in a Dataset?

dglazer commented 9 years ago

A few thoughts:

For the longer discussion, I largely agree with @richarddurbin 's framing of the requirements.

I don't know enough yet about the progress on metadata in general to have an opinion on whether Dataset should be merged with the new concepts proposed there, or should be left as an orthogonal administrative grouping, independent from the new semantic groupings.

Last and least:

lh3 commented 9 years ago

I have no opinions on using one or multiple PRs to resolve Dataset. As this is a discussion thread, I will raise this: with Dataset becoming global, does it make sense to lift CallSet in variants.avdl also to a global object? The current CallSet is essentially:

record VariantSet { string id; string datasetId; }
record CallSet {
  string id;
  union { null, string } name = null;
  union { null, string } sampleId;
  array<string> variantSetIds = [];
}

We have something different in reads.avdl (in that we don't have a dataset-specific name):

record ReadGroup {
  union { null, string } datasetId = null;
  union { null, string } sampleId;
}

The ReadGroup version is closer to SRA, but I don't think it is working in practice because sample matching across SRA studies is hard and inconsistent at present (for a simple example see the query result of NA12044 in the BioSample database; NA10851 and NA12878 are much more complicated). A proposal is to define in common.avdl or metadata.avdl:

record DatasetSample { // feel free to change the name
  string id; // internal ID of this DatasetSample; unique in a entire data repository
  string datasetId;
  string displayName; // sample name shown in BAM header or in VCF; can't be null
  union { null, string } sampleId = null; // link to a Sample object if available
}

Then in variants.avdl, we remove CallSet and use datasetSampleId instead of callSetId in other variant objects. In reads.avdl, we replace ReadGroup.{datasetId,sampleId} with a single ReadGroup.datasetSampleId.

The major benefit of this proposal is it decouples our simple practical need (get a sample by a short name in BAM/VCF) and the complex procedure of sample matching. The proposal achieves this by demanding a DataSet-specific submitter-defined sample name. It also directly connects variants, reads and possibly other sample-related objects in a dataset.

PS: I will explain why the NA12044 query result is problematic. NA12044 is a hapmap/1000g sample. The cellline is available from Coriell. When people use NA12044, it is almost certainly the same sample. In the result page, however, we see multiple NA12044 BioSamples with different sample names: NA12044, E-MTAB-197:NA12044, GEUV:NA12044 and CEU-NA12044.

mbaudis commented 9 years ago

@lh3 (further up) These points will have to be solved through query/access implementations. In principle, Dataset would be defined through a list of its member records' UUIDs. Queries would then either go Dataset => UUID list => record, or use a record's UUID to search Datasets.

pgrosu commented 9 years ago

To support all these ideas we would definitely need enforce several naming nomenclatures as appropriate and have controlled vocabularies. I agree with @mbaudis that being able to reference similar samples would simplify the searches, which by standarizing would provide richer results. Also in that approach it would allow for the possibility of implementing the mapping of samples <-> datasets - as @lh3 mentioned - including many other mappings that would be generated on-the-fly, which I previously recommended via inverted indices. Such indices also would allow for searches such as the following key-value pair-type for even more general-mapping:

Liver_Tissue -> DataSet_A_UUID.Sample1.ReadGroup1 Liver_Tissue -> DataSet_B_UUID.Sample44.ReadGroup3

Such mapping would associate (group) studyies or datasets to specific underlying data. This would also satisfy what @vadimzalunin, @richarddurbin and @dglazer is looking for. Again the key-value pairs can be key-key or key-value, or even more complex ones such as key.key.key-value, where key.key.key is a key. So everything is possible.

We can even implement the n-gram concepts of search engines, where key-pairing would improve the results (i.e. a query of the terms "United" and "States" would be better associated as "United States"). In this example such an index would be as follows:

Lung_Tissue & adenocarcinoma_stage_1 -> DataSet_md5sum_3e9e77456ba.CallSet_id_324 Lung_Tissue & adenocarcinoma_stage_1 -> DataSet_md5sum_a275127a7b6.CallSet_id_993247

These types of searches are dynamic and are being constantly generated on-the-fly since today's search engine queryies are almost 70000/second or more world-wide. These can even be generated by filtering through a model to improve the quality of the mapping with rank-scores. Thus our type of read-only access - which is very parallel to web-search engine design - can be accommodated by a similar approach.

Paul

diekhans commented 9 years ago

Richard Durbin notifications@github.com writes:

  1. I support strongly Heng’s proposal that sample names are unique within a DataSet

I strongly feel that samples should have globally unique ids though out the world.

The problems of identification will haunt us forever if we don't make it part of GA4GH.

mdmiller53 commented 9 years ago

+1 also for removing Dataset from reads; perhaps adding a GenomicDataset for this specific purpose i don't agree that study is equivalent, at least not in the way i've seen dataset used, which is the data generated for a study from the participants' samples. the study design then allows structuring the relationship of the data in the dataset(s) of the study. the Google use case is simply for the genomic data, not the study that generated it.

@diekhans

take a look at w3c dataset description

right now the GA4GH focus is on variants but my primary interest is as things move forward, to where the summarized (level 3 data in TCGA terms) and the datasets from clustering and further analysis can be described using GA4GH approved standards. it is not a good idea to 'box' everything into GA4GH's initial efforts, but allow room to grow through out the entire functional genomics space

mellybelly commented 9 years ago

The W3C dataset group would welcome test cases, evaluation, and improvement. They did a nice job reviewing a lot of existing standards; we are using it in our work and have found insufficiencies that we are feeding back. Would be good to synergize these efforts.

rlesca01 commented 9 years ago

okay sounds good

On Wed, Jun 3, 2015 at 1:02 PM, Melissa Haendel notifications@github.com wrote:

The W3C dataset group would welcome test cases, evaluation, and improvement. They did a nice job reviewing a lot of existing standards; we are using it in our work and have found insufficiencies that we are feeding back. Would be good to synergize these efforts.

— Reply to this email directly or view it on GitHub https://github.com/ga4gh/schemas/issues/248#issuecomment-108524788.

mbaudis commented 9 years ago

@mellybelly But isn't the W3C aimed at the general description of more like dataset=resource? That's not the discussion here about dataset/study/... records; this is about aggregation of records sharing some common features (either intrinsically, e.g. clinical diagnosis; or procedurally, e.g. part of same study, provenance ...).

pgrosu commented 9 years ago

@mdmiller53 I like the way you think :) Could you maybe generate a PR or create an issue - or even expand here - regarding the way reads/variants/etc including downstream analysis models would integrate with the other teams' components in optimizing for distributed over-the-wire data models that incorporates what you are envisioning. You probably seen my previous post in terms of how this would all fit together, but I would be interested in how you would tie it all together to allow room for growth with regards to capabilities, which I'm happy to say has also been my message over the past year. Not sure if you might have already seen the Priorities for the Data Working Group document, but I included it just in case :)

Thank you and look forward to it, Paul

mdmiller53 commented 9 years ago

@pgrosu in looking at the priorities, it is 3. Expression, methylation, and other epigenetic data. that i am referring to and the metadata will also need to eventually be suitable for describing that data in the summarized format (tsv usually, TCGA Level 3) once the WG gets to it, not just the initial BAM from sequencing that the summarized and corrected values are generated from. i've been trying to put together a document that describes the work we are doing in my lab at ISB for the metadata as part of our CGC contract. i'm off on vacation next week but will get to that on my return, near the end of the month

pgrosu commented 9 years ago

@mdmiller53 I am very excited about what I hear and eagerly look forward to your document. There have been many discussions on the importance of metadata searching. You've probably seen some of my posts on inverted indices for implementing them in a distributed, replicated balanced data-structure for optimized retrieval on any (meta)data efficiently - basically one of the core concepts of how large search engines are implemented these days.

Regarding the expression portion, there is a RNA-Seq task team which you might want to connect with. The only people I know that are part of that team are Sean (@saupchurch) and Alastair (@afirth).

In any case, hope you have a wonderful vacation and look forward to reading your document - no rush :)

Thank you, Paul

dglazer commented 9 years ago

@diekhans -- I believe that #389, which is now committed, means we can close this issue. Doing so now; feel free to reopen if you disagree.