ReadGroup should be a first-class data type

ga4gh / ga4gh-schemas

Models and APIs for Genomic data. RETIRED 2018-01-24

http://ga4gh.org

Apache License 2.0

214 stars 110 forks source link

ReadGroup should be a first-class data type #24

Closed lh3 closed 10 years ago

lh3 commented 10 years ago

In the current schema, ReadGroup is an attribute of a Read only accessible with the tags field of Read (line 145). It is not a first-class object and probably not indexed.

We should promote ReadGroup to a first-class data type for several reasons:

A read group is the smallest set of homogeneous reads. Although there is still ambiguity in this definition, ReadGroup is in practice more clearly defined and more stable than ReadSet (see #16) and DataSet. It is a first-class concept in SAM widely used in data processing. Many downstream tools, such as GATK, samtools and Picard, focus on read groups and frequently ignore the concept of file (or ReadSet). As a matter of fact, all ReadGroups from 1000g are associated with SRA/ERA accession numbers. Most 1000g BAM files are not for now, because ReadSet is harder to define than ReadGroup.
Sample and experimental info are attached to ReadGroup in SAM, not to a physical file or a ReadSet in the schema. When we query by sample, by library or by sequencing technology, we are directly searching for ReadGroups, not for ReadSets. In addition, it is possible that a ReadSet may contain ReadGroups from different samples - SAM is the same. When we query by sample, we will have to break a ReadSet. ReadGroup is the minimal unit.

In SAM, a read may not belong to any read groups, but in ReadStore, we can require a read to belong to one and only one ReadGroup. If the input file lacks read groups, we can automatically generate a new read group for all the reads in the input file.

richarddurbin commented 10 years ago

I support all this, consistent with my previous post. In particular, I support that each Read must belong to one and only one ReadGroup.

Richard

On 21 Apr 2014, at 03:15, Heng Li wrote:

In the current schema, ReadGroup is an attribute of a Read only accessible with the tags field of Read (line 145). It is not a first-class object and probably not indexed.

We should promote ReadGroup to a first-class data type for several reasons:

A read group is the smallest set of homogeneous reads. Although there is still ambiguity in this definition, ReadGroup is in practice more clearly defined and more stable than ReadSet (see #16) and DataSet. It is a first-class concept in SAM widely used in data processing. Many downstream tools, such as GATK, samtools and Picard, focus on read groups and frequently ignore the concept of file (or ReadSet). As a matter of fact, all ReadGroups from 1000g are associated with SRA/ERA accession numbers. Most 1000g BAM files are not for now, because ReadSet is harder to define than ReadGroup.

Sample and experimental info are attached to ReadGroup in SAM, not to a physical file or a ReadSet in the schema. When we query by sample, by library or by sequencing technology, we are directly searching for ReadGroups, not for ReadSets. In addition, it is possible that a ReadSet may contain ReadGroups from different samples - SAM is the same. When we query by sample, we will have to break a ReadSet. ReadGroup is the minimal unit.

In SAM, a read may not belong to any read groups, but in ReadStore, we can require a read to belong to one and only one ReadGroup. If the input file lacks read groups, we can automatically generate a new read group for all the reads in the input file.

— Reply to this email directly or view it on GitHub.

The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE.

dglazer commented 10 years ago

@lh3, @richarddurbin -- great to get some bio-expertise in this discussion -- thanks for jumping in. Some clarifying comments / questions:

To be clear on two points where I think we all agree:
1. All proposals, including the current one, preserve the mapping from Read to ReadGroup (RG), and the critical RG-level data like sample and library.
2. ReadSets are not meant to map 1:1 to files (although they might occasionally do so, depending on how the data is laid out). We all want to break away from the old file-centric model.
Today's FASTQ/BAM files support 'orphaned' Reads that don't belong to any RG. Heng proposes that we auto-create a new RG for those orphans.
1. Would we create exactly one RG per orphan-containing file?
2. Would we just give it default empty properties (e.g. sample, library)?
Heng, re downstream tools like GATK -- my understanding was that the typical unit of analysis was "all the RGs that came from a single sample processed a single way", which are typically all stored in a single file. (In other words, one file contains one or more "analysis units", each of which contains one or more RGs.)
1. Is that accurate? (Life would be a lot simpler if the unit of analysis were a RG, but my understanding is it's not.)
2. ReadSets are meant to capture that notion of an "analysis unit". Our current BAM importer uses a simple heuristic of "all RGs in one file with one sample name go into one ReadSet". Is that a good default? (It only has to be mostly accurate; it would be easy to support custom import behavior if important.)
Re making RGs first-class -- how globally populatable is a RG? My understanding is "not very", meaning that it's rare for a single RG to have data stored in different files, and very very rare for people from different teams, or even different months in the same lab, to contribute data to the same RG. (Or conversely -- the common case is for all the data in a single RG to be generated at one point in time and space, and stored in one file.)
1. Is that accurate?

richarddurbin commented 10 years ago

First, I am happy to get away from thinking in terms of standard BAM files.

2.1: Yes, I would create exactly one RG per orphan-containing file. I would deprecate orphan-containing files. 2.2: Yes, if this happens give the RG default empty properties.

3: Certainly the natural units of analysis may not be single ReadGroups, nor single machine runs. But (a) these things change with sequencing technology - quite possibly on the Xten there will be a 1-1 correspondence, (b) whereas there may will be hundreds of millions of reads in a ReadGroup, there will be relatively small numbers such as tens or hundreds of ReadGroups in the ReadSet constructed for most analyses, in particular any single sample analysis. (There might be complex analyses across very large numbers of samples that involve many thousands of ReadGroups, but this is still much smaller than hundreds of millions.)

After our earlier discussion, I realised something else though. A basic operation we want to support is to extract a genomic slice of data from one or more ReadSets, containing all reads that intersect a genomic region. Is the resulting set of reads a ReadSet? If so, then it is obvious that this ReadSet is not composed of complete ReadGroups. So now we would have two types of ReadSet: "complete" ReadSets that are built recursively from ReadGroups, and more general ReadSets for example built from projections of complete ReadSets satisfying some predicate, or set operations on these. It would still be true that all the Reads in a general ReadSet would each belong to one and only one ReadGroup, and would obtain some of their properties (e.g. their sample) from that ReadGroup.

Does that fit into your current design? Or will there be a separate collection class for these more general sets of reads?

Richard

On 21 Apr 2014, at 14:54, David Glazer wrote:

@lh3, @richarddurbin -- great to get some bio-expertise in this discussion -- thanks for jumping in. Some clarifying comments / questions:

To be clear on two points where I think we all agree:

All proposals, including the current one, preserve the mapping from Read to ReadGroup (RG), and the critical RG-level data like sample and library. ReadSets are not meant to map 1:1 to files (although they might occasionally do so, depending on how the data is laid out). We all want to break away from the old file-centric model. Today's FASTQ/BAM files support 'orphaned' Reads that don't belong to any RG. Heng proposes that we auto-create a new RG for those orphans.

Would we create exactly one RG per orphan-containing file? Would we just give it default empty properties (e.g. sample, library)? Heng, re downstream tools like GATK -- my understanding was that the typical unit of analysis was "all the RGs that came from a single sample processed a single way", which are typically all stored in a single file. (In other words, one file contains one or more "analysis units", each of which contains one or more RGs.)

Is that accurate? (Life would be a lot simpler if the unit of analysis were a RG, but my understanding is it's not.) ReadSets are meant to capture that notion of an "analysis unit". Our current BAM importer uses a simple heuristic of "all RGs in one file with one sample name go into one ReadSet". Is that a good default? (It only has to be mostly accurate; it would be easy to support custom import behavior if important.) Re making RGs first-class -- how globally populatable is a RG? My understanding is "not very", meaning that it's rare for a single RG to have data stored in different files, and very very rare for people from different teams, or even different months in the same lab, to contribute data to the same RG. (Or conversely -- the common case is for all the data in a single RG to be generated at one point in time and space, and stored in one file.)

Is that accurate? — Reply to this email directly or view it on GitHub.

cassiedoll commented 10 years ago

Just one small comment, when you: "extract a genomic slice of data from one or more ReadSets, containing all reads that intersect a genomic region"

you do not get a Readset - you just get an array of Reads. And each read has a readsetId which you can use to lookup Readset information if you wish.

massie commented 10 years ago

richarddurbin commented 10 years ago

That is fine to just get an array of reads for the more general slices and so on.

But I am a bit puzzled by the .avdl file as I see it. It seems to say that each Read is in one and only one ReadSet, but I thought that each Read was going to be in one and only one ReadGroup, and that the ReadSet definition was going to be recursive, built from ReadGroups and other ReadSets. So there is no need in Read for a readSetId, but there is a need for a unique non-optional readGroupId, whereas in ReadSet we would need an array of readGroupId and an array of readSetId.

Also surely there must be an id for ReadSet and ReadGroup. These can't be optional, otherwise we can't include these in higher level sets.

Also surely we need to have a field for the reference sequence for the alignment to map to in Read. I think this is an int, being the index into the array of ReferenceSequence in the header.

Then I might note that there are two special types of ReadSet, one containing all the accessible ReadGroups for a library, and the other all the accessible ReadGroups for a sample (often but not always these are the same).

I am not familiar with AVRO and avdl, but all the links between objects seem to be ids, which are just strings, which doesn't seem to impose very strong typing. Is this typical? It seems quite low level. I would have imagined something more like:

Read { ReadGroupID group ; optional String name ; // I don't understand the distinction between this id and the name, given both are optional // info about the sequence, the alignment, and the pair }

ReadGroup { ReadGroupID id ; LibraryID library ; // This must be present. If not present use a default LibraryUnknown object. // strictly SampleID is a property of Library, but maybe it is good to denormalise and have it here as well. // machine run information, including technology, centre, date etc. }

ReadSet { ReadSetID id ; array readGroups ; array subReadSets ; // can't have both readGroups and subReadSets empty // potential information about the logical unit that this read set represents }

There could be an option to flatten the definition of ReadSet to its constituent set of ReadGroups, even if was constructed from other ReadSets, but I am not sure that is a good idea. I imagine that there might be a ReadSet of all the ReadGroups for a sample, and then one or more higher level ReadSets containing, for example, all the sample ReadSets for a pedigree or set of samples. Then it would be convenient if new data are collected for the sample to just add the new ReadGroup to the ReadSet for its sample, and for any higher level ReadSets to automatically acquire this.

Richard

On 21 Apr 2014, at 16:59, cassiedoll wrote:

Just one small comment, when you: "extract a genomic slice of data from one or more ReadSets, containing all reads that intersect a genomic region"

you do not get a Readset - you just get an array of Reads. And each read has a readsetId which you can use to lookup Readset information if you wish.

— Reply to this email directly or view it on GitHub.

cassiedoll commented 10 years ago

Generally speaking, I do not like the idea of readsets being recursive. It seems to add complexity to the design when a flat hierarchy is adequate.

The way I see things, analysis is typically done at the readset level (e.g. a sample analyzed in a particular manner). Variant calling would be an example. You would use a method like /reads/search?readsetId=x to get the reads you want to do analysis on.

And if that analysis needs to take readgroup information into account, it can use the read.readgroupId field (which I proposed in #25) to do so. This makes for a very simple, hard to screw up, api. If it proved useful, I could see readgroupId acting as a filter on read search. (/reads/search?readsetId=x&readgroupId=y)

Read search can support other mechanisms to support querying over multiple readsets if necessary. That allows combining readsets without complicating the data model.

If we pretend that there isn't any recursion, what use case is not satisfied? If we have a specific example it might make things clearer.

richarddurbin commented 10 years ago

If we want the analysis to be over readsets, which both you and I do, then you can't have a unique readset for each read. There will be different analyses that you want to perform, over overlapping but non-identical sets of reads. We have plenty of experience of this.
The one thing that will always be true is that we either want all the reads in a readgroup, or none of them. So the readsets used for analysis are logically made out of readgroups.

I think you are confusing the API with the implementation. We should be talking about the conceptual API. It is cheap for an implementation to find all the reads in a readset by first finding all the read groups in the read set, then all the reads in those readgroups. The second step will dominate so much over the first in time and complexity that the first is negligible. There are opportunities for smart localisation/parallelisation of the reads in a readgroup, I would imagine. I gave a specific example of where recursion would be useful, where I might want to update all the readsets that contain a particular sample. But as I said, if you really don't like the recursion it would be possible to not allow it in the definition. You would still need readsets distinct from readgroups, with readsets made from a set of readgroups, and readgroups allowed to be in multiple readsets.

Your alternate proposal would require a read to keep track of all the readsets in which it participates. This would need an arbitrarily extensible structure such as an array in the read, which then is more costly to search, and also the process of creating a new readset will be very costly, requiring updating all the hundreds of millions of reads in it. In my proposal the unique readgroup for a read will be fixed, write-once, and creating a readset just requires creating an array of the much smaller number of readgroups (or if recursive readsets) in it.

Richard

On 21 Apr 2014, at 18:39, cassiedoll wrote:

Generally speaking, I do not like the idea of readsets being recursive. It seems to add complexity to the design when a flat hierarchy is adequate.

The way I see things, analysis is typically done at the readset level (e.g. a sample analyzed in a particular manner). Variant calling would be an example. You would use a method like /reads/search?readsetId=x to get the reads you want to do analysis on.

And if that analysis needs to take readgroup information into account, it can use the read.readgroupId field (which I proposed in #25) to do so. This makes for a very simple, hard to screw up, api. If it proved useful, I could see readgroupId acting as a filter on read search. (/reads/search?readsetId=x&readgroupId=y)

Read search can support other mechanisms to support querying over multiple readsets if necessary. That allows combining readsets without complicating the data model.

If we pretend that there isn't any recursion, what use case is not satisfied? If we have a specific example it might make things clearer.

— Reply to this email directly or view it on GitHub.

lh3 commented 10 years ago

I am aware that in theory, ReadSet does not necessarily correspond to a BAM file, but what does a ReadSet represent in practice? We cannot simply group data from one sample. For example, 1000g sequenced NA12878 many times. We often would not want to mix data from different sources and technologies in one analysis. I am not sure how a ReadSet can be different from a submitted BAM file in practice.

At the same time, different analyses may prefer to use different read groups in one file. For example, suppose we have short- and long-insert data in one file. For variant calling, we would prefer to use short-insert data only. For de novo assembly and the discovery of SVs, we will need both types of reads. There is no "analysis unit" suitable for all types of analyses. Depending on the type of analyses, we collect ReadGroups to form the right analysis unit. ReadGroup is the lowest denominator.

I think a concept like ReadSet is useful, but ReadGroup is more fundamental conceptually, more clearly defined and universal to all analyses.

Re @dglazer point 4: in SAM, a read group is allowed to be split into multiple files in theory. In ReadStore, it is probably clearer to require each read group to be present in one and only one file or ReadSet. All reads strictly fit into a tree hierarchy.

dglazer commented 10 years ago

(Meta-comment: Good discussion; I wish we could do this all in one room around a whiteboard; I think we'd be a lot more efficient.)

Summarizing the answers I hear to my four questions from this morning:

we agree that a RG is the smallest group of homogeneous reads, and that it's important to preserve RG-level data like sample and library
we agree that every read should belong to exactly one RG, and that therefore import should auto-create one RG per orphan-containing file, and give it default empty properties
active discussion; see below
we agree that import only needs to support RGs that are fully contained in a single file

Re (3) -- I now realize that we're starting with different mental models of how RGs map to "analysis units", that those different models are leading to different API requirements, and that those requirements are leading to different implementation opinions. I hope that getting in sync on model will help us agree on the use cases we want the API to allow and to encourage.

Here's a picture of three possible models: screenshot 2014-04-21 at 4 41 30 pm

Things to note about the picture:

all the models share the concept of RG (green shapes), as defined above
all the models support analysis and exploration happening over collections of one or more RGs
the models differ in how we think about those collections
- the top model is built around ad-hoc collections of RGs (red dotted line); you can draw a red dotted line around one or more RGs, and no particular red dotted line is 'more equal' than any other one
- the middle model says one particular collection of RGs (blue solid line) is used >90% of the time, for both analysis and exploration, but other collections of RGs (red dotted line) are possible
- the bottom model say the blue solid lines are far and away the primary object, with a given RG always analyzed as part of the same collection

My original mental model was largely based on my personal experience of getting sequenced. I sent off a sample, and got back a BAM file that contained all the ILMN reads for that sample. They happened to be divided into four RGs, but I didn't care about that -- I always want to analyze all of those reads as a single chunk, and I always want to browse all of them as a single chunk. When I'm picking data to work with from a menu, I always want to see "PGP12345", not have to find and select the four separate RGs independently. That story led me to using the bottom row as my initial mental model.

I realize that's not the only story, though, and that there are valid reasons to sometimes work with individual RGs, or even with different collections of RGs for different purposes. But I still believe that, in the vast majority of use cases, one collection is 'more equal' than all others, as it was in my story. That belief led me to using the middle row as my mental model -- I agree we need to support people working with ad hoc collections of RGs, but I think we can make everyone's life easier by strongly encouraging a particular collection as the preferred one, that's right >>90% of the time.

What do folks think? Do you think of working with RGs as like the top model, the middle model, or some other model?

lh3 commented 10 years ago

Strictly speaking, none of these models cover all the use cases. The better model is close to the middle except that the red circles are allowed to group ReadGroups across multiple blue circles.

The 1000g data Google has downloaded are the cleanest part. They are homogeneous and supposed to be analyzed the same way. It is fair to say that most of time we are only working with one particular collection. However, there are a variety of other data types and ways of analysis where multiple collections are frequently needed. Not exposing ReadGroups will make these analyses harder. We should not mandate or encourage a way of analysis based on our experience in the simplest 1000g case.

With your comment, @dglazer, I still don't know how the scope of ReadSet is defined if it is different from a submitted file. When I don't know what a ReadSet contains, I cannot simply take all the reads in the ReadSet. I would more like to retrieve by read group instead.

I am also not sure how your "PGP12345" example is against exposing ReadGroup as the first-class object. It is about the user interface, not related to the internal data models. In addition, we sometimes do want to select individual read groups. For example, NA12878 has been sequenced tens of times. When I retrieve NA12878 data, I would certainly want to select a few read groups of my interest.

Note that I am in favor of a concept of file/ReadSet (as well as ReadCollection in the initial NCBI proposal) from the very beginning. However, exposing ReadSet should not hide ReadGroup. ReadGroup should be a first-class object. It conceptually simplifies the data model, and makes it more extensible to all use cases.

dglazer commented 10 years ago

Heng, I agree RGs need to be exposed, and it should be possible to do analysis over ad hoc collections of RGs -- we're discussing how, not if, to expose them. Here's an updated picture based on your input: screenshot 2014-04-22 at 6 02 04 am The key elements of this model are:

it is easy to work with ReadSets, to support the common case where one particular collection of RGs is the desired unit of analysis and exploration
it is possible to work with ad hoc collections of one or more RGs, to support other types of analysis

To your question of how the scope of a ReadSet is defined -- logically, it's "the collection of RGs we think people are going to want to work with >>90% of the time". One common case, as you suggest, is that all the RGs from a single file will make sense as a single ReadSet. Another common case is that a single file will be broken into a handful of ReadSet's, each of which contains all the RGs that share a sample name. (Which is the heuristic used by Google's current API implementation.) And I suspect that, either now or down the road with new sequencing hardware, another common case will be that a single RG will make sense as a ReadSet on its own.

To your comment about NA12878 having been sequenced many times -- there's no problem there, since ReadSet is not defined solely by sample. There can be many ReadSets all containing data from NA12878, each of which is made up of one or more RGs that will typically be analyzed together. (E.g. because they were sequenced at the same time using the same sample prep and hardware.)

The next step, assuming this model sticks, is to propose specific schema changes that support it. (Some of which, such as #25, are already in flight.)

lh3 commented 10 years ago

Yes, this model looks better - we usually work with blue circles, but sometimes need to use a subset of read groups or combine read groups from different blue circles. The frequency of taking the red circles depends on the type of analysis.

Have we reached a consensus that ReadGroup should be a first-class object like ReadSet? If so, we may close this issue and then move on to the "how" part in #25.

richarddurbin commented 10 years ago

Hello Heng,

Overall I am happy with the way this discussion is going, with David's last drawing and email.

I want the read sets to be more virtual. In practice we do different analyses requiring different collections of read groups.

In fact in practice we currently make different BAM files refactored in different ways from the same read groups, e.g. the 1000 Genomes distribution BAMs compared to the original deposited BAMs collected off the machines. In a more database ReadStore world, we don't want to have create copies like this.

So I don't want a unique read set for each read group. Why do you want this? I don't really understand the argument that the only natural ReadSet is defined by a file. From my point of view we are aiming to get away from thinking about files.

I think David G should be very wary of the argument that 90% of the analysis will be on a preferred ReadSet. At both Sanger and Broad we repeatedly jointly analyse data across different sets of samples. I can imagine if I think two clinical samples have the same genetic lesion wanting to jointly analyse their data in one ReadSet.

To summarise, I expect that in David's coloured diagram world a typical "non-standard" red analysis will not be across a subset of ReadGroups from one sample with a subset of another sample, but rather including all the ReadGroups for several samples.

Richard

On 21 Apr 2014, at 20:32, Heng Li wrote:

I am aware that in theory, ReadSet does not necessarily correspond to a BAM file, but what does a ReadSet represent in practice? We cannot simply group data from one sample. For example, 1000g sequenced NA12878 many times. We often would not want to mix data from different sources and technologies in one analysis. I am not sure how a ReadSet can be different from a submitted BAM file in practice.

At the same time, different analyses may prefer to use different read groups in one file. For example, suppose we have short- and long-insert data in one file. For variant calling, we would prefer to use short-insert data only. For de novo assembly and the discovery of SVs, we will need both types of reads. There is no "analysis unit" suitable for all types of analyses. Depending on the type of analyses, we collect ReadGroups to form the right analysis unit. ReadGroup is the lowest denominator.

I think a concept like ReadSet is useful, but ReadGroup is more fundamental conceptually, more clearly defined and universal to all analyses.

Re @dglazer point 4: in SAM, a read group is allowed to be split into multiple files in theory. In ReadStore, it is probably clearer to require each read group to be present in one and only one file or ReadSet. All reads strictly fit into a tree hierarchy.

— Reply to this email directly or view it on GitHub.

lh3 commented 10 years ago

@richarddurbin I have kept arguing with myself whether having a concept of ReadSet is beneficial. I find I don't really know the answer in the end. I am open to either option. EDIT: I don't know whether we should define ReadSet, but I am sure that having ReadGroup as the first-class object is crucial.

dglazer commented 10 years ago

Quick thoughts; more later:

glad we're converging on a model
to @richarddurbin 's point -- yes, my mental model includes red analyses that cover all the RGs for several samples -- the point of the red dotted lines is they should be flexible enough to cover any useful collection of RGs
@lh3 , I think we should wait to close this issue until we have a pull request in flight that fully satisfies the requirements; #25 is a good start, but not yet sufficient IMO
half baked: I expect that pull request will mostly consist of extra query options to /reads/search, since that's where a caller of the API actually says "here's what I want to analyze/browse". We'll see.

lh3 commented 10 years ago

The following lengthy comment is related but a little off topic. I don't know where to put them, so raise them here for now.

I expect that pull request will mostly consist of extra query options to /reads/search, since that's where a caller of the API actually says "here's what I want to analyze/browse". We'll see.

We should not confine ourselves to what Google has at present. In fact, as a user, I won't use the google APIs because a) readset_id is a hash string which is meaningless to me - I don't know what readset_id I should use; b) the scope of read set is not clearly defined - I don't know if the result is right for my purpose. I think the most useful end-user query is something like /reads/search?collection=SRP000547&sample=NA12044, which is equivalent to a more direct query /reads/search?readgroups=ERR162825,ERR162807,ERR162819,SRR393990,SRR393991.

Here SRP000547 is the SRA study for "Whole genome sequencing of (CEU) Utah residents with ancestry from Northern and Western Europe - CEPH - HapMap population". The five ERR/SRR* are the accessions for NA12044 reads. You can find the info in the header of the NA12044 BAM file.

In addition, a query /reads/search?collection=SRP000031&sample=NA12044 will also pull NA12044 sequenced by 1000g, but with this query, we will get the Pilot data instead. Note that NA12044 has been sequenced twice by 1000g. Without going through read groups, I would worry if the ReadSet in Google Genomics gives the data I want.

In this example, read group is the central element. I actually don't care what is a read set.

If we pretend that there isn't any recursion, what use case is not satisfied? If we have a specific example it might make things clearer.

The example above is also related to recursion. In SRA, SRP000547 is equivalent to PRJNA33831 (CEU low-cov sequencing), which is a sub-project of PRJNA59771 (all-population low-cov sequencing). Ideally, a user would like to get the right results also with /reads/search?collection=PRJNA59771&sample=NA12044 (the query basically says to retrieve NA12044 data from 1000g low-cov sequencing). A recursive structure is naturally occurring here.

I think we can live without a recursive structure, but having the feature may simplify the relationship between the many second-class concepts.

Anyway, back to the start of the comment, I think there are a lot we need to do to improve the end-user APIs. We should not limit ourselves to what we've already had.

dglazer commented 10 years ago

Heng, glad we're in sync that adding new query options to /reads/search is the right way forward. (Your example of an optional readgroups= parameter is actually already in flight in #26.) And we're absolutely in sync that we shouldn't be limited by the present -- the whole point of this repo is to move towards an implementation that works for the future.

There are now a bunch of pending pull requests, including #25 and #26, that surface RGs more visibly, and move us towards the blue+red mental model from this morning. I suggest we focus on getting those approved and committed (nudge nudge -- please +1 them if you agree and haven't already). Once those are in, we can work on the next steps.

richarddurbin commented 10 years ago

From my point of view this is an excellent example to talk about.

My take on how to handle the situation that Heng describes is that to get the reads you want you take two steps. First you define a local ReadSet object by something like

myReadSet = /readgroups/search?collection=SRP000547&sample=NA12044

then get the reads with

myReads = /reads/search?readGroups_in_ myReadSet.groups

My guess is that the syntax is not right for this, but this is a natural API. In other words, a ReadSet object is just a collection of ReadGroups on which you want to do analysis. Some of these might be stored in the database, with names or other properties that allow them to be retrieved, others can be created dynamically as here.

Richard

On 22 Apr 2014, at 20:42, Heng Li wrote:

The following lengthy comment is related but a little off topic. I don't know where to put them, so raise them here for now.

I expect that pull request will mostly consist of extra query options to /reads/search, since that's where a caller of the API actually says "here's what I want to analyze/browse". We'll see.

We should not confine ourselves to what Google has at present. In fact, as a user, I won't use the google APIs because a) readset_id is a hash string which is meaningless to me - I don't know what readset_id I should use; b) the scope of read set is not clearly defined - I don't know if the result is right for my purpose. I think the most useful end-user query is something like /reads/search?collection=SRP000547&sample=NA12044, which is equivalent to a more direct query /reads/search?readgroups=ERR162825,ERR162807,ERR162819,SRR393990,SRR393991.

Here SRP000547 is the SRA study for "Whole genome sequencing of (CEU) Utah residents with ancestry from Northern and Western Europe - CEPH - HapMap population". The five ERR/SRR* are the accessions for NA12044 reads. You can find the info in the header of the NA12044 BAM file.

In addition, a query /reads/search?collection=SRP000031&sample=NA12044 will also pull NA12044 sequenced by 1000g, but with this query, we will get the Pilot data instead. Note that NA12044 has been sequenced twice by 1000g. Without going through read groups, I would worry if the ReadSet in Google Genomics gives the data I want.

In this example, read group is the central element. I actually don't care what is a read set.

If we pretend that there isn't any recursion, what use case is not satisfied? If we have a specific example it might make things clearer.

The example above is also related to recursion. In SRA, SRP000547 is equivalent to PRJNA33831 (CEU low-cov sequencing), which is a sub-project of PRJNA59771 (all-population low-cov sequencing). Ideally, a user would like to get the right results also with /reads/search?collection=PRJNA59771&sample=NA12044 (the query basically says to retrieve NA12044 data from 1000g low-cov sequencing). A recursive structure is naturally occurring here.

I think we can live without a recursive structure, but having the feature may simplify the relationship between the many second-class concepts.

Anyway, back to the start of the comment, I think there are a lot we need to do to improve the end-user APIs. We should not limit ourselves to what we've already had.

— Reply to this email directly or view it on GitHub.

lh3 commented 10 years ago

@richarddurbin I agree that ReadSet may not be critical in Google's ReadStore design. However, it is a useful concept in case of SRA and ENA. As I understand, both SRA and ENA use file-based storage. They can easily extract data from a file, but are clumsy to combine data across different files. The NCBI implementation only provides access to one file. In this case, a concept of file or ReadSet becomes explicit.

dglazer commented 10 years ago

(I'd love to talk by voice/video about these points, and/or get some more perspectives into the conversation -- I have a feeling we mostly agree, and our differences are at least half misunderstandings. Until then, here are three more thoughts.)

A) Design Goals First -- this conversation is not about Google's ReadStore design. It's about the GA4GH Reads API, which Google intends (along with hopefully many others) to implement. I want the API to meet all of the following goals:

for users of genomic data (in science/research and medicine/health) -- provide useful functionality, so many of them advocate for their tool and data providers to support it
for builders of genomic tools -- be easy and natural to call, so many of them choose to consume data via it
for hosters of genomic data -- be reasonably easy to implement efficiently, so many of them choose to expose their data via it

B) Files Second -- I'm not concerned about files, directly, at all. (We actually ripped several references to files out of an earlier prototype of our API.) The only tiny exception is that I do care about being able to tell one RG from another, which is a subtle but important point.

Fine print, using a slightly contrived, but I believe representative of the real-world, example:

Suppose somebody imports a BAM file containing three RGs with IDs={1,2.3} and SN="control", and another three RGs with IDs={1,2,3} and SN="case". Now suppose somebody else, maybe from the same lab and maybe not, later imports a BAM file containing six different RGs with the same attributes.
- That may not be recommended practice, but I expect, especially in the growing world of cheap distributed sequencing used for a mix of public and private purposes, it will be a common occurrence. (Somewhat exacerbated by our previous agreement to auto-create a default RG for orphaned Reads.)
In that situation, I know it's important for callers of the API to be able to tell the four sets of RGs apart from each other, and I believe it's important for callers to know that the three RGs in each set should be processed as a unit.

In the current version of the AVRO schema, that 3+3+3+3 organization is handled by ReadSets. There may be other ways to handle it, such as auto-generating new RG attributes on import, but I believe any proposal has to handle it somehow.

C) Red and Blue Finally -- I'd like to revisit the mental model discussion, since it feels to me like we're not yet in sync.

In both the blue+red and the red-only models, all of the operations Richard describes -- e.g. "get all reads in collection=SRP000547&sample=NA12044)" -- are supported. (It's very possible I'm misunderstanding and that's not true -- let's talk about the gap if so.)
The difference is that the blue+red model (most recent picture in this thread) also makes it easy for callers to operate on preferred collections of RGs, which I believe is necessary for the corner case described above, and useful for many more common cases. (Again, if I'm misunderstanding, let's talk about the gap.)

The right implementation details depend on the model we want to support. If we're chasing different models, we'll end up with an 'ugly duckling' problem, where person 1's beautiful swan doesn't meet person 2's standards of elegance. I can learn to like either ducks or swans; I just don't want to end up with a chimera.

lh3 commented 10 years ago

Re files: Ideally, we should get rid of files for good, but practically, file-based implementations still need this concept. The following may be obvious but it would be good to speak it out: we are mostly (there are exceptions) creating a BAM file for read groups from one sample AND from one uniform study. Google is effectively creating ReadSets following the same convention. Perhaps it would be good to make this point explicit when we describe what is a ReadSet in our current schema (@richarddurbin might be define ReadSet differently).

Re ReadGroup ID: as @richarddurbin and I said in #25, ReadStore should generate read group IDs at least unique in this ReadStore instance (e.g. GOR012345). Ideally, ReadStores should sync read groups IDs like what GenBank/SRA, EMBL/ENA and DDBJ are doing for tens of years for all public biological sequences. This way, a read group ID can be globally unique across all ReadStore instances. All 1000g read groups are accessioned.

The only tiny exception is that I do care about being able to tell one RG from another, which is a subtle but important point.

This is not best achieved with ReadSet. This should be done with a concept like study or project or session. When a small lab submit a batch of BAMs, the lab ought to request a new study accession or to reuse an existing accession the lab requested before. A read group in study A is always different from another read group in study B even if they have the same RG:Z name in the submitted BAMs, no matter whether the BAMs are submitted by the same lab or not. EDIT: if a caller specifies the study accession and the sample name, it will get all the related read groups and process them together. (EDIT2: perhaps study is a synonymous to dataset in google's original design).

richarddurbin commented 10 years ago

I am sure the correct thing is to provide a mechanism to generate globally unique ReadGroup ids, as NCBI and EBI do as Heng says. There are many precedents for this in the internet world, from LSIDs made from URLs to uuids.

Then I think you should drop the files when you import them - they are just containers for tansferring read groups. I point out that the primary file accessions to NCBI and EBI from Sanger do not correspond to aggregated data for samples, but rather to Illumina lanes, which can contain several read groups because of multiplexing, and all the read groups for a sample will be split across several accessions. I am sure the same is true for other centres, because data can take time to accumulate in the ENA for a single sample. So the idea that a submitted file is the standard unit of analysis is just wrong.

Richard

On 23 Apr 2014, at 16:59, Heng Li notifications@github.com wrote:

Re files: Ideally, we should get rid of files for good, but practically, file-based implementations still need this concept. The following may be obvious but it would be good to speak it out: we are mostly (there are exceptions) creating a BAM file for read groups from one sample AND from one uniform study. Google is effectively creating ReadSets following the same convention. Perhaps it would be good to make this point explicit when we describe what is a ReadSet in our current schema (@richarddurbin might be define ReadSet differently).

Re ReadGroup ID: as @richarddurbin and I said in #25, ReadStore should generate read group IDs at least unique in this ReadStore instance (e.g. GOR012345). Ideally, ReadStores should sync read groups IDs like what GenBank/SRA, EMBL/ENA and DDBJ are doing for tens of years for all public biological sequences. This way, a read group ID can be globally unique across all ReadStore instances. All 1000g read groups are accessioned.

The only tiny exception is that I do care about being able to tell one RG from another, which is a subtle but important point.

This is not best achieved with ReadSet. This should be done with a concept like study or project or session. When a small lab submit a batch of BAMs, the lab ought to request a new study accession or to reuse an existing accession the lab requested before. A read group in study A is always different from another read group in study B even if they have the same RG:Z name in the submitted BAMs, no matter whether the BAMs are submitted by the same lab or not.

— Reply to this email directly or view it on GitHub.

fnothaft commented 10 years ago

I am +1 at using read sets as the unique identifier for a dataset.

The only tiny exception is that I do care about being able to tell one RG from another, which is a subtle but important point.

This is not best achieved with ReadSet. This should be done with a concept like study or project or session. When a small lab submit a batch of BAMs, the lab ought to request a new study accession or to reuse an existing accession the lab requested before. A read group in study A is always different from another read group in study B even if they have the same RG:Z name in the submitted BAMs, no matter whether the BAMs are submitted by the same lab or not. EDIT: if a caller specifies the study accession and the sample name, it will get all the related read groups and process them together. (EDIT2: perhaps study is a synonymous to dataset in google's original design).

Unless I misunderstand this approach, what is the behavior of the read group ID when we apply different processing stages to the data? E.g., if we sequence individual A on machine X, lane N, we get a unique identifier per read group in the sample. If we then align with BWA vs. bowtie vs. , does each read group still have the same ID after alignment? The provenance of the sample/sequencing instrument haven't changed, so I don't think the read group ID should change, but the data is fundamentally different.

I think it makes the most sense if we use the read set ID as a unit for grouping a set of multiple read groups across 1+ samples that were analyzed together. In:

We should not confine ourselves to what Google has at present. In fact, as a user, I won't use the google APIs because a) readset_id is a hash string which is meaningless to me - I don't know what readset_id I should use; b) the scope of read set is not clearly defined - I don't know if the result is right for my purpose. I think the most useful end-user query is something like /reads/search?collection=SRP000547&sample=NA12044, which is equivalent to a more direct query /reads/search?readgroups=ERR162825,ERR162807,ERR162819,SRR393990,SRR393991.

It was suggested that a collection ID + sample ID would make more sense. However, I would assert that this is metadata that probably shouldn't be stored in the read store (instead, large scale sequencing studies can easily map collection ID + sample ID --> read set ID), and that this identification scheme is less general. E.g., consider the case of a hospital that is sequencing for clinical use:

There's not necessarily a well defined collection ID. What is the collection ID for a patient who shows up with a novel, previously unseen condition?
Sample ID may not be sufficient to identify a processed set of reads/alignments. In the case where multiple sequence analysis pipelines are used in an ensemble calling approach, a unique collection ID and unique sample ID may still map to multiple processed sets of reads.

However, if a unique read set ID is provided per dataset, these issues can be addressed as long as metadata is maintained by whomever is generating the sequence data itself.

lh3 commented 10 years ago

I am +1 at using read sets as the unique identifier for a dataset.

To clarify, in google's API and in the current schema, a "dataset" is a specific concept closer to a study or a collection in my comments. Disregarding this caveat, I am not sure what a data set you are referring to. A lane of data, data from a sample in one project, data from a sample across multiple projects or a whole project can all be considered to be a data set in the generic sense. A problem with read set is its scope is not always clear.

Unless I misunderstand this approach, what is the behavior of the read group ID when we apply different processing stages to the data? E.g., if we sequence individual A on machine X, lane N, we get a unique identifier per read group in the sample. If we then align with BWA vs. bowtie vs. , does each read group still have the same ID after alignment? The provenance of the sample/sequencing instrument haven't changed, so I don't think the read group ID should change, but the data is fundamentally different.

A good catch on the same data processed in different ways. The solution is actually simple. ReadGroup is not defined as a combination of sample+sequencing unit. In your case, we should keep the two alignments in two different read groups from two different collections. Users can pull all the BAMs produced by one mapper by specifying the right collection.

It was suggested that a collection ID + sample ID would make more sense. However, I would assert that this is metadata that probably shouldn't be stored in the read store (instead, large scale sequencing studies can easily map collection ID + sample ID --> read set ID)

Say I want to find de novo mutations from the NA12878 trio which has been sequenced several times with varying data quality. I have to use the data sequenced together. It is trivial to achieve with collection+sample. What is your solution if the ReadStore only keeps read sets with unique IDs?

There's not necessarily a well defined collection ID. What is the collection ID for a patient who shows up with a novel, previously unseen condition?

This is also simple. It is up to the data submitters to decide whether they want to put the new patient in an existing collection or in a new collection. Either is fine. It should be easy to update the collection/dataset info later in the database if the submitters change their mind.

Sample ID may not be sufficient to identify a processed set of reads/alignments. In the case where multiple sequence analysis pipelines are used in an ensemble calling approach, a unique collection ID and unique sample ID may still map to multiple processed sets of reads.

This has been addressed by my comment above: the submitters should use multiple collection IDs. In a reviewed submission system, they should be asked to make changes. In an automated system where the flaw is not identified, we need to add more searching conditions (e.g. matching PG) to pinpoint the right data. We still need these additional conditions to get the right read sets as we don't magically know the right read set IDs.

However, if a unique read set ID is provided per dataset, these issues can be addressed

Anyway, I said personally I am more or less in favor of having a concept of ReadSet. If we decide to keep ReadSet, the ReadSet ID should be unique and query-able.

dglazer commented 10 years ago

@lh3, thank you very much for starting this conversation. I've tried to consolidate all the ideas I've heard, in this thread and elsewhere, into a new summary proposal at #32, so am closing this issue. (I included a back-reference, so anyone interested can see the full history, but new folks don't have to wade through it.)