Open david4096 opened 8 years ago
Under the SRA/ENA model, this is address by the experiment/run/analysis metadata. It is only implicit in GA4GH API because the metadata model is incomplete.
Both samples and readgroups can have variants called in multiple times and multiple ways . One has to have the provenience to correctly answer these questions.
Suggest reading: SRA handbook: http://www.ncbi.nlm.nih.gov/books/NBK47528/ GDC data model: https://gdc.nci.nih.gov/developers/gdc-data-model.
But please be aware that this type of metadata isn't treated by the MTT ("everything but the sequence"); though we'll jump in upon specific requests ...
See also #390 for earlier thoughts on a data model for tracking object provenance.
Well, the good news is that there is already a way that has been kindly prepared for us by Google. At the beginning of July they published a very nice paper called Goods: Organizing Google’s Datasets. We can implement such a system - as our datasets would also become quite varied and large - and they have already worked out the issues with tracking the provenance, and inferred metadata.
Since our metadata would be provided and stored with different types of objects - through the nice work done by the MTT team - the first part of the search capability is basically a given. Then comes the second part of collecting and building inferred metadata through metadata inference, which can be performed via the creation of transitive closure graphs on the sets loaded in the different repositories. This automatically provides us with the provenance information as well. Thus the ability to inspect connected BAMs with associated CallSets becomes a trivial query.
They store the datasets in a catalog, which is searchable via their Google Dataset Search (Goods) API. For us we can figure out a name that would also naturally work for our implementation. Below a figure illustrating how the catalog is organized:
In order to connect datasets together, we can even provide a query system similar to the Google Knowledge Graph, in order to determine connected sets of data, and/or those that are processed as results through similar pipelines with overlapping functional context. Below are two links to give you an idea of how knowledge graphs would work:
For example, we can utilize semi-lattices representing sets of connected data, by collapsing using subsets of annotated information to work with groups of equivalent versions, and/or other matching criteria as follows:
We can even propagate metadata shared among sets of data in order to consolidate information, allowing us to formulate more direct queries:
Hope all of these ideas will help and will spark ideas, which is what is the drive behind great projects like FireCloud and ISB-CGC in enabling us to pave paths towards handling collections of millions and billions of data and results.
Paul
Thanks for all of the input! I am trying to get to putative changes to the schemas that allow for this tracking because I believe it is not in place.
Currently, it is not possible to state which variant set came from which read group sets. My suggestion is to represent these data by making the following change:
Add a read group ID list to the Call Set message. Each Call Set comes from a specific set of read groups. That way, one can go back and view the pileup at a given position, for example.
For the data I have observed one can derive these relationships, however, this change would make it explicit.
@diekhans
Both samples and readgroups can have variants called in multiple times and multiple ways . One has to have the provenience to correctly answer these questions.
In the GA4GH metadata model a callset is given a biosample ID. However, if I understand correctly, this is not enough to identify the specific readgroups a callset came from. Would adding a list of Read Group Ids to a callset message cover this case?
I believe associating with readgroups is the missing piece
I'd like to close this by adding a list of read_group_ids
to the CallSet
message. The problem is that we have tagged callsets with a biosample ID, which if I understand this thread, is not entirely correct.
A ReadGroup is always from a single biosample ID, but a callset can be made from multiple read groups. That means that it is possible to construct a callset that is for multiple samples. For 1kgenomes, this may seem like an odd case, but I think we may have made an incorrect assumption of tagging callsets with a single biosample ID.
It seems to me the correct access pattern is to provide filtering of readgroups by their biosample ID, and then filtering callsets by their read group IDs. This avoids the scenario of improperly labeling a callset as being from a single sample, when in fact it is from multiple.
The problem is that, in practice, much useful interchange and analysis can be done without the BAM. That means that we need to provide an access pattern for when someone has metadata about samples, but no alignment data.
If the biosample ID were a repeated field in CallSets then we could support the case when metadata is available about a call, but no read alignment. In the case where multiple samples were used to make a call, both biosample IDs would be provided.
To close this issue I suggest we do the following:
biosample_ids
on CallSets a repeated fieldread_group_ids
on the CallSet messageIt would be nice to have a search method for callsets to return any callsets coming from a list of provided read group IDs.
And I might as well ask, @mbaudis @diekhans can a read group come from multiple samples? I can imagine that we should make the biosample_id
on readgroups a repeated field in that case. This would allow us to model spiking a sample for sequence nicely.
The relationship between variant sets and read group sets should be explicit. The question of: "which BAM was used for which callset"? is currently answered implicitly through the variant set metadata. However, the GA4GH has the opportunity to improve the situation by making the relationships between the Read Group Sets used for an analysis pipeline, and the resulting Variant Set.
This could be carried out in a few ways. Currently, one can construct a request to determine if a callset came from the same sample as a readgroup via their
bio_sample_id
s. However, the same sample can appear in multiple readgroups. This leaves the statement, "which BAM or set of RG tags were used for constructing this callset," difficult to answer directly.One might consider adding a
read_group_ids
field to a CallSet record, making that relationship explicit. This would allow individual RG tags across BAMs to be reassembled as needed to provide the underlying data for each call. It would then be trivial to construct a query that asks, which BAMs were used when making this call.An alternative would be to provide a map of
call_set_id:read_group_id
in the variant set metadata.I understand that for phased calling these relationships can become complicated. Any insight into the other basic requirements is valued!