ga4gh / ga4gh-schemas

Models and APIs for Genomic data. RETIRED 2018-01-24
http://ga4gh.org
Apache License 2.0
214 stars 110 forks source link

Group of Individuals/Cohorts #724

Open david4096 opened 8 years ago

david4096 commented 8 years ago

The data model should provide a way to group individuals. It currently groups Samples and Individuals, but being able to make statements about groups should be included as a first class part of the data model.

@mbaudis @mcourtot

david4096 commented 7 years ago

Should individuals be able to be assigned to multiple cohorts? Proposal, add a cohort message and a list of cohort_ids to the Individual message.

mbaudis commented 7 years ago

Yes; but shouldn't the cohort reference its members?

david4096 commented 7 years ago

Yes and in the data it will, but in terms of interchange we can probably satisfy membership through queries of individuals by cohort ID in a method similar to the other APIs.

mcourtot commented 7 years ago

Hi @david4096 - could you please expand on what you mean by

we can probably satisfy membership through queries of individuals by cohort ID in a method similar to the other APIs.

Thanks!

david4096 commented 7 years ago

Hi @mcourtot, of course, thank you for asking! The idea is that a cohort message doesn't say which individuals are a member, since that message might be very large. Like, if there are 65k individuals in a cohort that message would be very difficult to work with, if you had added each individual ID to the cohort message. Because of that, we can add a "cohort_ids" field to an individual message so an individual can be put in multiple cohorts in a dataset.

Then, when you want to know which individuals are in a given cohort, you request only those individuals via SearchIndividualsRequest that has that cohort ID set.

In this way, the access pattern would go, "search for cohorts matching some criteria", "search for individuals matching cohort ID". Then we can reconstruct the individuals that were part of a cohort.

mbaudis commented 7 years ago

@david4096 But you are solving a technical obstacle by introducing a conceptual xxx.

I have always made a point (well early MTT etc.) that groupings of records should in principle be treated dynamically; i.e. you go via a stored set, or via a query output.

A "cohort" therefore may be all callsets from breast cancer samples sequenced using some WES technique, based on biosamples from stage II tumors of pre-menopausal (proxy age ...) female smokers.

This could be a curated cohort, or a query with changing content (remember: GA4GH is about "federated" data).

Now, technically one could do it both ways: Even a query based cohort could first associate a record with a cohort identifier, and proceed from there.

Generally, IMO the preference for a schema solution based on "the message could get large" is flawed; work arounds can happen at the implementation stage.

But I may be wrong, of course ...

david4096 commented 7 years ago

Thanks @mbaudis ! I am happy to revoke my premature optimization of message size. The Cohort message itself I consider to be the metadata like "samples sequenced using this technique." Some combination of queries and fields can handle the referential integrity, but if we can move ahead with a simpler change I am all for it.

I think if we're ok with what might become a large Cohort message (if there are 10k samples it might take a moment to download), we can avoid any other API changes. The membership of a biosample or individual to a given cohort will be in the cohort message, which can then be used to construct directed queries against the other biosamples or individuals endpoints. We then just need to provide easy ways of generating valid cohort messages for a given dataset.

I would like to consider the aspirations of federating queries, but would also like to constrain the discussion and have placed a stub for federation here. You can read some of my thoughts about getting there in this issue about searching by external identifers.

The alternative is to follow the idiom of the references API and make the connection "loose" between cohort messages and biosample or individual messages. Implementing over this type of schema is challenging, as there is a lot of room for interpretation left to the implementor. I think it's simplicity of using a single document will make it easy to implement. If we face issues with very large documents we can provide an API for listing the underlying individual and biosample identifiers.