lh3 commented 10 years ago

It is not clear to me the relationship between data types described in ga4gh.avdl. More specifically:

A ReadSet seems a part of dataset (line 76), but dataset is not defined.
A Read seems a part of ReadSet (line 103). But how is it related to a ReadGroup? Are we supposed to get the ReadGroup information by decoding tags (line 145)?
A ReadGroup is part of HeaderSection (line 61) and a HeaderSection is part of ReadSet (line 86) and can appear multiple times in a ReadSet (also line 86). Then a ReadSet effectively represents an ensemble of multiple BAM files. Is this the intention? If so, is it possible to identify a single BAM?
What is the purpose of GA4GHHeader (line 42)? Why is it allowed to appear multiple times in HeaderSection (line 55)?

If we want to follow the SAM structure, perhaps it would be clearer to define Read as a part of ReadGroup and a ReadGroup as a part of ReadSet. A ReadSet could represent one SAM file which contains a single HeaderSection (in this case, we can merge HeaderSection into ReadSet). Another option is to skip the concept of ReadSet.

dglazer commented 10 years ago

Good questions; here are partial responses:

Re dataset -- the Google prototype uses dataset as an administrative collection of ReadSet's (e.g. different people may be allowed to add data to different datasets). I agree it's odd to have it half-present in the .avdl. @massie, I don't remember if we consciously left datasets out of the first pass of schema? If we want to add a GADataSet object, we should open up a new issue for that.
Re ReadSet vs. ReadGroup -- see #16 for an attempt to clarify this exact point. I think once that's settled, the other questions will mostly fall out.
Re HeaderSection -- the idea is to allow mapping back to raw BAM files, but not be constrained by legacy BAM semantics. It will often be the case that one BAM file maps to one ReadSet, but that's not required; often one BAM will map to multiple ReadSets, and someone might want multiple BAMs to map to a single ReadSet (although I expect that's a rare case).
Re Header -- see #2 - the current schema is trying to just mirror what's in BAMs, but I agree with that issue that we should do better, and capture the same data in a more logical way.

cassiedoll commented 10 years ago

21 proposes removing the header field (which is unnecessary for us), and inlining the headersection field for clarity. #2 still needs further followup, but #21 should be a step in the right direction.

lh3 commented 10 years ago

On Header and HeaderSection, I like Cassie's #21. It is cleaner to eliminate Header and merge HeaderSection into ReadSet. With #21, a ReadSet is similar to a BAM file in effect (see also #16) - it consists of one or more read groups and has a single sequence dictionary and a single program dictionary. I am okay with this definition.

There still remains a question about ReadSet vs ReadGroup. At line 103 of the schema, Read is part of ReadSet. We wouldn't know the ReadGroup information without decoding the tags (line 145). In addition, because the ReadGroup info of each read is not indexed, we will not be able to quickly pull reads from a particular read group. I think it would be better to change line 103 to: string readGroupId;. Thus we will have a hierarchy: Read => ReadGroup => ReadSet => DataSet.

dglazer commented 10 years ago

Thanks Heng. Breaking this apart:

if I hear right, you support both #21 and #16 -- great. Please add explicit "+1" responses there. (To stick with the new process we're test-driving.)
I agree with the container hierarchy you list, with the addition that ReadGroup is optional -- not all Reads have to belong to a ReadGroup. (Based on my understanding of the state of the art; I know we wrote code to handle that case.)
The current schema treats Read, ReadSet, and DataSet as first-class objects, with unique IDs and formal access methods. As you point out ReadGroups's are different and not normalized. I don't know if ReadGroup's should be promoted to first-class objects; it depends mostly on whether we think they're an important construct that will be regularly accessed by API callers, or a lower-priority construct that's tied to particular sequencing hardware. If you think they should be promoted, are you up for submitting a pull request with a specific proposal, so we can discuss the details there? Thanks.

richarddurbin commented 10 years ago

No. I think that we decided that every Read should belong to a ReadGroup. Then a ReadSet can contain ReadGroups or other ReadSets (or both). From my point of view ReadGroups are very important objects. Reads only come in ReadGroups, and many properties can only be assigned to ReadGroups, not to individual reads. For example, the library used (and hence the sample the read comes from), the instrument run on which the read occurred etc. should all belong to the ReadGroup. The only properties attached directly to the Read itself are measurements of the Read, such as the sequence, quality, mate, mapping location etc.

This is important to resolve.

Richard

On 20 Apr 2014, at 20:06, David Glazer wrote:

Thanks Heng. Breaking this apart:

if I hear right, you support both #21 and #16 -- great. Please add explicit "+1" responses there. (To stick with the new process we're test-driving.) I agree with the container hierarchy you list, with the addition that ReadGroup is optional -- not all Reads have to belong to a ReadGroup. (Based on my understanding of the state of the art; I know we wrote code to handle that case.) The current schema treats Read, ReadSet, and DataSet as first-class objects, with unique IDs and formal access methods. As you point out ReadGroups's are different and not normalized. I don't know if ReadGroup's should be promoted to first-class objects; it depends mostly on whether we think they're an important construct that will be regularly accessed by API callers, or a lower-priority construct that's tied to particular sequencing hardware. If you think they should be promoted, are you up for submitting a pull request with a specific proposal, so we can discuss the details there? Thanks. — Reply to this email directly or view it on GitHub.

The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE.

dglazer commented 10 years ago

Thanks @richarddurbin -- I've responded in detail to your similar comment on #24.

@lh3 - I suggest we close this issue as obsolete; I believe all the actionable bits are covered in #16, #21, and your new more-detailed #24. If you agree, can you close it? Thanks.

ga4gh / ga4gh-schemas

The relationship between data types in ga4gh.avdl #20

21 proposes removing the header field (which is unnecessary for us), and inlining the headersection field for clarity. #2 still needs further followup, but #21 should be a step in the right direction.