Closed jeromekelleher closed 9 years ago
This hierarchy idea is broken somewhat by the Call
and CallSet
classes:
CallSet
has a field variantSetIds
, so it can belong to more than one VariantSet
. Do we really want CallSets
to belong to more than one VariantSet
? It would be a lot simpler if we had one call belonging to exactly one Variant
and one CallSet
.SearchCallsRequest
is badly overloaded: we can search for a calls contained in a list of callSetIds
, variantSetIds
and variantIds
. We should replace all of these with a single callSetId
, since (a) a CallSet
belongs to a VariantSet
, so specifying the vaiantSetId
is redundant; and (b) obtaining the Calls
associated with a given Variant
should be done directly using a SearchVariantsRequest
. We should ask what the point of a SearchCallsRequest
is at all in this case, as the idea of getting all the Calls
in a CallSet
without reference to their parent Variants
seems somewhat obscure...SearchAlleleCalls
suffers from similar problems, having four different lists of IDs that we can provide as parameters.I would also like to know what function the Dataset was supposed to serve.
It encapsulates a set of data with shared access control (most important) and typically provenance. For example from a single deposited VCF. There may also be namespace implications, e.g. sample names are unique in a dataset, but not necessarily between. Something like this is necessary in practice when data from more than one source is stored in a server.
Richard
On 14 May 2015, at 21:07, Danny Colligan notifications@github.com wrote:
I would also like to know what function the Dataset was supposed to serve.
— Reply to this email directly or view it on GitHub https://github.com/ga4gh/schemas/issues/248#issuecomment-102152607.
The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE.
I agree with what @richarddurbin said. To add to it, the role is much clearer for reads where there would otherwise be no hierarchical collection beyond a ReadGroupSet
. The value of a Dataset for variants was somewhat mitigated by the introduction of VariantSets, which serve a similar purpose. Also the intent was that they are the same object across VariantSets and ReadGroupSets. It is natural to have reads and related variants for a particular study in the same dataset, for ease of sharing and management.
CallSet has a field variantSetIds, so it can belong to more than one VariantSet. Do we really want CallSets to belong to more than one VariantSet? It would be a lot simpler if we had one call belonging to exactly one Variant and one CallSet.
I didn't realize this was the case. I don't think CallSets should be able to span VariantSets; maybe I'm missing some background information as to why this would be desirable.
@calbach Will the same sample be part of different VariantSets? Can two VariantSets be technical replicates in a larger study?
If it's specific to reads, then it should have a less generic name.
If it's intended to be more general, it should be moved out of reads.
What ever is intended, this intention needs to be documented.
CH Albach notifications@github.com writes:
I agree with what @richarddurbin said. To add to it, the role is much clearer for reads where there would otherwise be no hierarchical collection beyond a ReadGroupSet. The value of a Dataset for variants was somewhat mitigated by the introduction of VariantSets, which serve a similar purpose. Also the intent was that they are the same object across VariantSets and ReadGroupSets. It is natural to have reads and related variants for a particular study in the same dataset, for ease of sharing and management.
CallSet has a field variantSetIds, so it can belong to more than one VariantSet. Do we really want CallSets to belong to more than one VariantSet? It would be a lot simpler if we had one call belonging to exactly one Variant and one CallSet.
I didn't realize this was the case. I don't think CallSets should be able to span VariantSets; maybe I'm missing some background information as to why this would be desirable.
— Reply to this email directly or view it on GitHub.*
I'll reference myself from what I posted previously under server
: https://github.com/ga4gh/server/issues/376#issuecomment-101865761
So think of a Dataset as study that was performed and which joins/groups all the relevant data associated with it in a logical grouping. It is a collection of one or more of the following:
It also encompasses what @richarddurbin is emphasizing. Remember this diagram I posted almost a year ago here:
Notice how the permissions propagate down to the Variant
level, which is initiated by the user's permission that generated the Project/Study
. At any time these can be turned to public
so everyone can view them. So by utilizing controlled vocabulary through which one can reference other samples, then one can link multiple datasets together. Thus yes, the ability to query multiple datasets is important since one dataset cannot contain all the relevant data that one will ever need. As previously suggested in #253 you will need the ability to query multiple datasets, especially for metadata among other things. Also having one call belonging to exactly one Variant
and one CallSet
might make indexing costly, though there are other approaches that would cut down those costs. We are still below version 1.0, and adding too many limitations would limit the potential of the possibilities of what we can do with the API.
Hope it helps, Paul
Agreeing with and elaborating on @richarddurbin and @calbach 's responses to @dcolligan 's question:
I would also like to know what function the Dataset was supposed to serve.
TL;DR: they're a convenient way for data providers to lump together related data of multiple types.
@pgrosu's comment reflects my take on discussions we had in metadata (e.g. see comment thread in https://docs.google.com/document/d/1Sl6FYwBHjWYYo2Ex29fqvmEiOQjeokgAUfXNGsVQwZ0/edit#heading=h.p2r9dh51rf8 ). I was proposing a however called generic group object, to represent a static or dynamically generated group of data objects (e.g. all samples or analyses matching a given set of criteria). Currently, there is just a "IndividualGroup" record type, which is rather conservative in mainly referring to "group of individuals e.g. a trio". I would love to see some traction to define a consistent mechanism/object type for this.
The problem is that the purpose is not documented in the schema. It should either be a pull request with documentation or one for it's removal.
If it's to apply to more than reads, it needs to be moved out of reads.
It also seems inadequate for any of the these goals, so there really needs to be some use cases investigated:
I am have now convinced myself that dataset should be removed.
Also, having objects reference their containers (datasetId) is inflexible and cumbersome. We would be far better off with a functional representation of the data model where objects are not bound to their containers.
Mark
David Glazer notifications@github.com writes:
Agreeing with and elaborating on @richarddurbin and @calbach 's responses to @dcolligan 's question:
I would also like to know what function the Dataset was supposed to serve.
TL;DR: they're a convenient way for data providers to lump together related data of multiple types.
• for access control: if a server wants to host data with heterogeneous access (e.g. a public copy of 1000genomes and a private copy of one researcher's study), there needs to be a place for data providers to specify who can see what. Datasets are a convenient level of granularity for doing so. • for billing: if a server wants to charge multiple data providers for the resources needed to store their data, it needs to be clear who gets charged for which data. Datasets are a convenient level of granularity for doing so. • for provenance: the server doesn't care about where data comes from (at least not with today's methods), but users do. Datasets are a convenient way to group all data from one logical source (e.g. a study).
— Reply to this email directly or view it on GitHub.*
So maybe we might want to start discussing how do we really want to use this API and its ideal purpose? Would it be to enable better elucidation of diseases for research purposes, or what might be some of the billing and dataset-as-permission incentives? The reason I pose this question is that, usually it is better to look at all the data than just part of it. If different individuals/teams have different access to it, that would make it difficult to properly identify the causes of some diseases in order to properly perform the followup experiments or recommend treatment. Imagine if BLAST - when it was first introduced in the 90's - provided varied permission levels to different users with different billing structures. Ideally in our case, the individuals would be anonymized so that access to all the data can still be used for research purposes. This would be the ideal way to effectively enable the system for both research and personalized medicine.
Paul
I am concerned about removing Dataset without replacing it with something that has already been proved to work better.
I also found DataSet vague when I originally came across it, and wanted to change or remove it. But the reason it is there is that Google, who have a working read and variant repository across multiple studies/projects, needed this wrapper concept to support their real world system. Then I realised that we have exactly the same thing within Sanger - we call it a Study. I conclude that it is essential to have a wrapper concept for practical genetic variation data repositories, and that despite all sorts of theorising about what might be ideal, a single layer wrapper as provided by DataSet is practically sufficient in quite complex settings.
I vote strongly for experience, and not removing something known to work without replacing it and convincing people who actually manage diverse repositories that the replacement is better.
Richard
On 20 May 2015, at 05:18, Mark Diekhans notifications@github.com wrote:
The problem is that the purpose is not documented in the schema. It should either be a pull request with documentation or one for it's removal.
If it's to apply to more than reads, it needs to be moved out of reads.
It also seems inadequate for any of the these goals, so there really needs to be some use cases investigated:
- access control - one would have a very hard time implementing TCGA's access control policy with dataset as key for access control. It would also mean that data would move between dataset during the life of the project.
- billing - vendor-specific tasks, such as billing, should be outside of the scope of GA4GH. Vendor's should have the flexibility to design policies of their choosing and the API should not be cluttered with things that might prove useful to someone.
- provenance - an extremely important problem and something that needs a mechanism that has finer granularity than dataset. Having this in here because it might prove useful to provenance isn't a good way to implement this critical feature.
I am have now convinced myself that dataset should be removed.
Also, having objects reference their containers (datasetId) is inflexible and cumbersome. We would be far better off with a functional representation of the data model where objects are not bound to their containers.
Mark
David Glazer notifications@github.com writes:
Agreeing with and elaborating on @richarddurbin and @calbach 's responses to @dcolligan 's question:
I would also like to know what function the Dataset was supposed to serve.
TL;DR: they're a convenient way for data providers to lump together related data of multiple types.
• for access control: if a server wants to host data with heterogeneous access (e.g. a public copy of 1000genomes and a private copy of one researcher's study), there needs to be a place for data providers to specify who can see what. Datasets are a convenient level of granularity for doing so. • for billing: if a server wants to charge multiple data providers for the resources needed to store their data, it needs to be clear who gets charged for which data. Datasets are a convenient level of granularity for doing so. • for provenance: the server doesn't care about where data comes from (at least not with today's methods), but users do. Datasets are a convenient way to group all data from one logical source (e.g. a study).
— Reply to this email directly or view it on GitHub.*
— Reply to this email directly or view it on GitHub https://github.com/ga4gh/schemas/issues/248#issuecomment-103750191.
The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE.
A flexibel way to define datasets/studies would be an object containing query & access parameters:
There is IMHO no real difference between a stored query and a "DataSet"; the latter would be a static version of this query style object (e.g. defined through the fixed object ids).
+1 with @richarddurbin, @mbaudis using a dataset/object-as-a-study - which I've had the same experience with in practice, and the reason I mentioned it last year - but then we should not limit the search across multiple datasets in https://github.com/ga4gh/schemas/pull/253.
@richarddurbin I am all for experience, however there is no experience for us to learn from because it's purpose and scope of DataSet is not documented.
The goal of the DWG needs to be limited to providing a data model and API for data exchange if it's going to be successful. Some things needed by vendor will be out-of-scope, however their requirements are also very valuable input.
The Study analogy is very compelling. However, there is no way to know, since DataSet is buried in reads.avdl and not documented. An API must be completely implementable from the IDL, documentation, and conformance suite.
Mark
Richard Durbin notifications@github.com writes:
I am concerned about removing Dataset without replacing it with something that has already been proved to work better.
I also found DataSet vague when I originally came across it, and wanted to change or remove it. But the reason it is there is that Google, who have a working read and variant repository across multiple studies/projects, needed this wrapper concept to support their real world system. Then I realised that we have exactly the same thing within Sanger - we call it a Study. I conclude that it is essential to have a wrapper concept for practical genetic variation data repositories, and that despite all sorts of theorising about what might be ideal, a single layer wrapper as provided by DataSet is practically sufficient in quite complex settings.
I vote strongly for experience, and not removing something known to work without replacing it and convincing people who actually manage diverse repositories that the replacement is better.
Richard
On 20 May 2015, at 05:18, Mark Diekhans notifications@github.com wrote:
The problem is that the purpose is not documented in the schema. It should either be a pull request with documentation or one for it's removal.
If it's to apply to more than reads, it needs to be moved out of reads.
It also seems inadequate for any of the these goals, so there really needs to be some use cases investigated:
- access control - one would have a very hard time implementing TCGA's access control policy with dataset as key for access control. It would also mean that data would move between dataset during the life of the project.
- billing - vendor-specific tasks, such as billing, should be outside of the scope of GA4GH. Vendor's should have the flexibility to design policies of their choosing and the API should not be cluttered with things that might prove useful to someone.
- provenance - an extremely important problem and something that needs a mechanism that has finer granularity than dataset. Having this in here because it might prove useful to provenance isn't a good way to implement this critical feature.
I am have now convinced myself that dataset should be removed.
Also, having objects reference their containers (datasetId) is inflexible and cumbersome. We would be far better off with a functional representation of the data model where objects are not bound to their containers.
Mark
David Glazer notifications@github.com writes:
Agreeing with and elaborating on @richarddurbin and @calbach 's responses to @dcolligan 's question:
I would also like to know what function the Dataset was supposed to serve.
TL;DR: they're a convenient way for data providers to lump together related data of multiple types.
• for access control: if a server wants to host data with heterogeneous access (e.g. a public copy of 1000genomes and a private copy of one researcher's study), there needs to be a place for data providers to specify who can see what. Datasets are a convenient level of granularity for doing so. • for billing: if a server wants to charge multiple data providers for the resources needed to store their data, it needs to be clear who gets charged for which data. Datasets are a convenient level of granularity for doing so. • for provenance: the server doesn't care about where data comes from (at least not with today's methods), but users do. Datasets are a convenient way to group all data from one logical source (e.g. a study).
— Reply to this email directly or view it on GitHub.*
— Reply to this email directly or view it on GitHub https://github.com/ga4gh/ schemas/issues/248#issuecomment-103750191.
The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE.
— Reply to this email directly or view it on GitHub.*
How about removing Dataset
from reads.avdl and adding Study
to common.avdl as a replacement? A study is user/submitter defined, loosely equivalent to a project. A central requirement of Study is that sample names must not be duplicated in a study. BTW, Study is a long existing concept in SRA/ENA.
@lh3 +1 for removing Dataset
from reads; but after that starts to get hairy.
Writing this as one of the metadata people, salted with personal opinions.
I actually agree that DataSet is better than term, but the documentation should explain how this relates to study.
I would favor moving directly to metadata. Common needs to go away and be broken into functional modules. Having a dumping ground module does not help in understanding the API.
Michael Baudis notifications@github.com writes:
@lh3 +1 for removing Dataset from reads; but after that starts to get hairy.
• Using the name "Study" is IMHO okish, but "Dataset" seems even better (more neutral; Study implies an activity, whereas Dataset or something like this just is a wrapper for somehow related data objects). • Placing it (e.g. Dataset) into common.avdl seems at the moment o.k., but questions (probably rightfully!) the existence of metadata.avdl. This leads to the general design decision we have to make: Do we want to have a) several collections of loosely connected record definitions like now (-3), or b) do we want to have everything which is not very specifically bound to a certain object type (like e.g. [paraphrasing] library to experiment) in a single document (+3), or c) do we want to populate the schema space with per-record/object files (+2).
Writing this as one of the metadata people, salted with personal opinions.
— Reply to this email directly or view it on GitHub.*
@diekhans As I said: "metadata - everything but the sequence"...
So, we should define Dataset (DataSet?) inside metadata.avdl, also replacing IndividualGroup
?! Absolutely in favo(u)r.
I actually agree that DataSet is better than term, but the documentation should explain how this relates to study.
+1
I would favor moving directly to metadata. Common needs to go away and be broken into functional modules. Having a dumping ground module does not help in understanding the API.
+1
So now seeming to understand @diekhans: This is not against a "catch it all", but for existence of OTRTA (one to rule ...), that is metadata.avdl?
I actually prefer this concept not to be too generic, but anyway, I don't really mind naming or where to put Dataset/Study. I only want to see: 1) a global Dataset/Study object and 2) sample names appearing in reads/refVar are not duplicated in a Dataset/Study.
+1 moving to Meta Data as well. My experience is that you always need a container of some kind. I prefer DataSet. Study has a design associated in my view.
On 20/05/2015 22:10, Michael Baudis wrote:
@lh3 https://github.com/lh3 +1 for removing |Dataset| from reads; but after that starts to get hairy.
- Using the name "Study" is IMHO okish, but "Dataset" seems even better (more neutral; Study implies an activity, whereas Dataset or something like this just is a wrapper for somehow related data objects).
- Placing it (e.g. Dataset) into common.avdl seems at the moment o.k., but questions (probably rightfully!) the existence of metadata.avdl. This leads to the general design decision we have to make: Do we want to have a) several collections of loosely connected record definitions like now (-3), or b) do we want to have everything which is not very specifically bound to a certain object type (like e.g. [paraphrasing] library to experiment) in a single document (+3), or c) do we want to populate the schema space with per-record/object files (+2).
Writing this as one of the metadata people, salted with personal opinions.
— Reply to this email directly or view it on GitHub https://github.com/ga4gh/schemas/issues/248#issuecomment-104040360.
Helen Parkinson, PhD Team Leader
European Bioinformatics Institute (EMBL-EBI) European Molecular Biology Laboratory Wellcome Trust Genome Campus Hinxton Cambridge CB10 1SD United Kingdom
EBI 01223 494672 For scheduling assistance please contact Lynn French lfrench@ebi.ac.uk, 01223 494 453 Skype: helen.parkinson.ebi http://www.ebi.ac.uk/about/people/helen-parkinson
O.k., I'll submit something to the metadata team discussion, before doing the PR on metadata.avdl. We'll have to discuss if
/**
Represents a group of data objects of one or more types (e.g. all Individuals, Samples, Experiments
associated with a clinical study; or e.g. a trio in genetic diagnostics.)
*/
record DataSet {
/** The dataset UUID. This is globally unique. */
string id;
/** The name of the dataset. */
union { null, string } name = null;
/** A description of the dataset. */
union { null, string } description = null;
/**
The time at which this record was created.
Format: ISO 8601, YYYY-MM-DDTHH:MM:SS.SSS (e.g. 2015-02-10T00:03:42.123Z)
*/
string recordCreateTime;
/**
The time at which this record was last updated.
Format: ISO 8601, YYYY-MM-DDTHH:MM:SS.SSS (e.g. 2015-02-10T00:03:42.123Z)
*/
string recordUpdateTime;
/** The type of dataset. Examples could be "trio", "metaanalysis", "gwas" ...*/
union { null, string } type = null;
/** The uuid's of included records. */
array<string> recordsIncluded = [];
/** The query leading to a dynamic assignment of dataset members.
This is just a placeholder for a yet-to-be-defined query object
union { null, metadataQuery } datasetQuery = null;
*/
/**
A map of additional individual group information.
*/
map<array<string>> info = {};
}
+1 on Dataset being closer to SRA Study Generic Dataset definition is a potential leaky abstraction. Google's Datasets may be so different from SRA Datasets that clients would have to take this into account.
On 21 May 2015, at 10:41, Vadim Zalunin notifications@github.com wrote:
+1 on Dataset being closer to SRA Study Generic Dataset definition is a potential leaky abstraction. Google's Datasets may be so different from SRA Datasets that clients would have to take this into account.
— Reply to this email directly or view it on GitHub https://github.com/ga4gh/schemas/issues/248#issuecomment-104203989.
The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE.
Thanks @richarddurbin for the extensive comments. We'll limit the definition to a list of uuid, and leave the query option out of this - one can always have an extrinsic implementation for creating query based "dynamic" datasets, collections, without promoting the mechanism upfront.
Ad 4b: Individuals and samples (and every other record) have to receive their own uuid at creation time. An Individual will have the same uuid in all different datasets, as long as it is not re-created. Problems will arise implementation wise, in tracking relations between different records - while we (will) have mechanisms in place (e.g. collecting "derivedFrom" and such), one can easily foresee multiple entry points into the system, especially for samples (i.e. new sample for existing individual and re-creation of new individual). But these are data management issues, which we can limit through good schema design, but which cannot completely be avoided.
Ad 4c: While references should not be part of the DataSet, included records may point to specific references; but those pointers are part of the lower level records, not the DataSet itself (?).
Users would want to know: 1) given a Dataset, which samples it has? and conversely 2) given a sample name (e.g. NA12878), which Datasets have this sample? In addition, do we allow a VariantSet to span multiple Datasets? What does a "record" mean in a Dataset?
A few thoughts:
For the longer discussion, I largely agree with @richarddurbin 's framing of the requirements.
I don't know enough yet about the progress on metadata in general to have an opinion on whether Dataset should be merged with the new concepts proposed there, or should be left as an orthogonal administrative grouping, independent from the new semantic groupings.
Last and least:
I have no opinions on using one or multiple PRs to resolve Dataset. As this is a discussion thread, I will raise this: with Dataset becoming global, does it make sense to lift CallSet in variants.avdl also to a global object? The current CallSet is essentially:
record VariantSet { string id; string datasetId; }
record CallSet {
string id;
union { null, string } name = null;
union { null, string } sampleId;
array<string> variantSetIds = [];
}
We have something different in reads.avdl (in that we don't have a dataset-specific name):
record ReadGroup {
union { null, string } datasetId = null;
union { null, string } sampleId;
}
The ReadGroup version is closer to SRA, but I don't think it is working in practice because sample matching across SRA studies is hard and inconsistent at present (for a simple example see the query result of NA12044 in the BioSample database; NA10851 and NA12878 are much more complicated). A proposal is to define in common.avdl or metadata.avdl:
record DatasetSample { // feel free to change the name
string id; // internal ID of this DatasetSample; unique in a entire data repository
string datasetId;
string displayName; // sample name shown in BAM header or in VCF; can't be null
union { null, string } sampleId = null; // link to a Sample object if available
}
Then in variants.avdl, we remove CallSet
and use datasetSampleId
instead of callSetId
in other variant objects. In reads.avdl, we replace ReadGroup.{datasetId,sampleId}
with a single ReadGroup.datasetSampleId
.
The major benefit of this proposal is it decouples our simple practical need (get a sample by a short name in BAM/VCF) and the complex procedure of sample matching. The proposal achieves this by demanding a DataSet-specific submitter-defined sample name. It also directly connects variants, reads and possibly other sample-related objects in a dataset.
PS: I will explain why the NA12044 query result is problematic. NA12044 is a hapmap/1000g sample. The cellline is available from Coriell. When people use NA12044, it is almost certainly the same sample. In the result page, however, we see multiple NA12044 BioSamples with different sample names: NA12044, E-MTAB-197:NA12044, GEUV:NA12044 and CEU-NA12044.
@lh3 (further up) These points will have to be solved through query/access implementations. In principle, Dataset
would be defined through a list of its member records' UUIDs. Queries would then either go Dataset => UUID list => record, or use a record's UUID to search Datasets.
To support all these ideas we would definitely need enforce several naming nomenclatures as appropriate and have controlled vocabularies. I agree with @mbaudis that being able to reference similar samples would simplify the searches, which by standarizing would provide richer results. Also in that approach it would allow for the possibility of implementing the mapping of samples <-> datasets
- as @lh3 mentioned - including many other mappings that would be generated on-the-fly, which I previously recommended via inverted indices. Such indices also would allow for searches such as the following key-value
pair-type for even more general-mapping:
Liver_Tissue
-> DataSet_A_UUID.Sample1.ReadGroup1
Liver_Tissue
-> DataSet_B_UUID.Sample44.ReadGroup3
Such mapping would associate (group) studyies or datasets to specific underlying data. This would also satisfy what @vadimzalunin, @richarddurbin and @dglazer is looking for. Again the key-value
pairs can be key-key
or key-value
, or even more complex ones such as key.key.key-value
, where key.key.key
is a key
. So everything is possible.
We can even implement the n-gram concepts of search engines, where key-pairing would improve the results (i.e. a query of the terms "United" and "States" would be better associated as "United States"). In this example such an index would be as follows:
Lung_Tissue & adenocarcinoma_stage_1
-> DataSet_md5sum_3e9e77456ba.CallSet_id_324
Lung_Tissue & adenocarcinoma_stage_1
-> DataSet_md5sum_a275127a7b6.CallSet_id_993247
These types of searches are dynamic and are being constantly generated on-the-fly since today's search engine queryies are almost 70000/second or more world-wide. These can even be generated by filtering through a model to improve the quality of the mapping with rank-scores. Thus our type of read-only access - which is very parallel to web-search engine design - can be accommodated by a similar approach.
Paul
Richard Durbin notifications@github.com writes:
- I support strongly Heng’s proposal that sample names are unique within a DataSet
I strongly feel that samples should have globally unique ids though out the world.
The problems of identification will haunt us forever if we don't make it part of GA4GH.
+1 also for removing Dataset from reads; perhaps adding a GenomicDataset for this specific purpose i don't agree that study is equivalent, at least not in the way i've seen dataset used, which is the data generated for a study from the participants' samples. the study design then allows structuring the relationship of the data in the dataset(s) of the study. the Google use case is simply for the genomic data, not the study that generated it.
@diekhans
take a look at w3c dataset description
right now the GA4GH focus is on variants but my primary interest is as things move forward, to where the summarized (level 3 data in TCGA terms) and the datasets from clustering and further analysis can be described using GA4GH approved standards. it is not a good idea to 'box' everything into GA4GH's initial efforts, but allow room to grow through out the entire functional genomics space
The W3C dataset group would welcome test cases, evaluation, and improvement. They did a nice job reviewing a lot of existing standards; we are using it in our work and have found insufficiencies that we are feeding back. Would be good to synergize these efforts.
okay sounds good
On Wed, Jun 3, 2015 at 1:02 PM, Melissa Haendel notifications@github.com wrote:
The W3C dataset group would welcome test cases, evaluation, and improvement. They did a nice job reviewing a lot of existing standards; we are using it in our work and have found insufficiencies that we are feeding back. Would be good to synergize these efforts.
— Reply to this email directly or view it on GitHub https://github.com/ga4gh/schemas/issues/248#issuecomment-108524788.
@mellybelly But isn't the W3C aimed at the general description of more like dataset=resource? That's not the discussion here about dataset/study/... records; this is about aggregation of records sharing some common features (either intrinsically, e.g. clinical diagnosis; or procedurally, e.g. part of same study, provenance ...).
@mdmiller53 I like the way you think :) Could you maybe generate a PR or create an issue - or even expand here - regarding the way reads/variants/etc including downstream analysis models would integrate with the other teams' components in optimizing for distributed over-the-wire data models that incorporates what you are envisioning. You probably seen my previous post in terms of how this would all fit together, but I would be interested in how you would tie it all together to allow room for growth with regards to capabilities, which I'm happy to say has also been my message over the past year. Not sure if you might have already seen the Priorities for the Data Working Group document, but I included it just in case :)
Thank you and look forward to it, Paul
@pgrosu in looking at the priorities, it is 3. Expression, methylation, and other epigenetic data. that i am referring to and the metadata will also need to eventually be suitable for describing that data in the summarized format (tsv usually, TCGA Level 3) once the WG gets to it, not just the initial BAM from sequencing that the summarized and corrected values are generated from. i've been trying to put together a document that describes the work we are doing in my lab at ISB for the metadata as part of our CGC contract. i'm off on vacation next week but will get to that on my return, near the end of the month
@mdmiller53 I am very excited about what I hear and eagerly look forward to your document. There have been many discussions on the importance of metadata searching. You've probably seen some of my posts on inverted indices for implementing them in a distributed, replicated balanced data-structure for optimized retrieval on any (meta)data efficiently - basically one of the core concepts of how large search engines are implemented these days.
Regarding the expression portion, there is a RNA-Seq task team which you might want to connect with. The only people I know that are part of that team are Sean (@saupchurch) and Alastair (@afirth).
In any case, hope you have a wonderful vacation and look forward to reading your document - no rush :)
Thank you, Paul
@diekhans -- I believe that #389, which is now committed, means we can close this issue. Doing so now; feel free to reopen if you disagree.
Currently, the
Dataset
type is poorly specified, and included inreads.avdl
. We have some comments likeThis needs to be clarified.
I propose:
DataSet
, so that we are consistent withVariantSet
,ReferenceSet
,ReadGroupSet
and others.DataSet
much more explicit, and make a formal definition of the data model as a hierarchy. (This relates to #247 also, since my proposal there assumes a hierarchy.) In this hierachy, theDataSet
is the root, and it has childVariantSet
s andReadGroupSets
. EachVariantSet
has childVariant
s; eachReadGroupSet
has childReadGroup
s, andReadGroup
s have childRead
s. We don't need to make any changes to the API to do this, we just need to be more explicit about what the model means.This formalisation should make reasoning about authorisation much more straightforward.
Thoughts?