ga4gh / ga4gh-schemas

Models and APIs for Genomic data. RETIRED 2018-01-24
http://ga4gh.org
Apache License 2.0
214 stars 114 forks source link

The relationship between data types in ga4gh.avdl #20

Closed lh3 closed 10 years ago

lh3 commented 10 years ago

It is not clear to me the relationship between data types described in ga4gh.avdl. More specifically:

If we want to follow the SAM structure, perhaps it would be clearer to define Read as a part of ReadGroup and a ReadGroup as a part of ReadSet. A ReadSet could represent one SAM file which contains a single HeaderSection (in this case, we can merge HeaderSection into ReadSet). Another option is to skip the concept of ReadSet.

dglazer commented 10 years ago

Good questions; here are partial responses:

cassiedoll commented 10 years ago

21 proposes removing the header field (which is unnecessary for us), and inlining the headersection field for clarity. #2 still needs further followup, but #21 should be a step in the right direction.

lh3 commented 10 years ago

On Header and HeaderSection, I like Cassie's #21. It is cleaner to eliminate Header and merge HeaderSection into ReadSet. With #21, a ReadSet is similar to a BAM file in effect (see also #16) - it consists of one or more read groups and has a single sequence dictionary and a single program dictionary. I am okay with this definition.

There still remains a question about ReadSet vs ReadGroup. At line 103 of the schema, Read is part of ReadSet. We wouldn't know the ReadGroup information without decoding the tags (line 145). In addition, because the ReadGroup info of each read is not indexed, we will not be able to quickly pull reads from a particular read group. I think it would be better to change line 103 to: string readGroupId;. Thus we will have a hierarchy: Read => ReadGroup => ReadSet => DataSet.

dglazer commented 10 years ago

Thanks Heng. Breaking this apart:

richarddurbin commented 10 years ago

No. I think that we decided that every Read should belong to a ReadGroup. Then a ReadSet can contain ReadGroups or other ReadSets (or both). From my point of view ReadGroups are very important objects. Reads only come in ReadGroups, and many properties can only be assigned to ReadGroups, not to individual reads. For example, the library used (and hence the sample the read comes from), the instrument run on which the read occurred etc. should all belong to the ReadGroup. The only properties attached directly to the Read itself are measurements of the Read, such as the sequence, quality, mate, mapping location etc.

This is important to resolve.

Richard

On 20 Apr 2014, at 20:06, David Glazer wrote:

Thanks Heng. Breaking this apart:

if I hear right, you support both #21 and #16 -- great. Please add explicit "+1" responses there. (To stick with the new process we're test-driving.) I agree with the container hierarchy you list, with the addition that ReadGroup is optional -- not all Reads have to belong to a ReadGroup. (Based on my understanding of the state of the art; I know we wrote code to handle that case.) The current schema treats Read, ReadSet, and DataSet as first-class objects, with unique IDs and formal access methods. As you point out ReadGroups's are different and not normalized. I don't know if ReadGroup's should be promoted to first-class objects; it depends mostly on whether we think they're an important construct that will be regularly accessed by API callers, or a lower-priority construct that's tied to particular sequencing hardware. If you think they should be promoted, are you up for submitting a pull request with a specific proposal, so we can discuss the details there? Thanks. — Reply to this email directly or view it on GitHub.

The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE.

dglazer commented 10 years ago

Thanks @richarddurbin -- I've responded in detail to your similar comment on #24.

@lh3 - I suggest we close this issue as obsolete; I believe all the actionable bits are covered in #16, #21, and your new more-detailed #24. If you agree, can you close it? Thanks.