Genentech / FacileDataSet

A fluent API for accessing multi-assay high-throughput genomics data.
MIT License
4 stars 0 forks source link

A story of a data model, multi-assays, and a mixed-assay table #15

Open VRouilly opened 6 years ago

VRouilly commented 6 years ago

It is an old topic ... with very many previous discussions, but I just wanted to bring it back to the discussion table.

My understanding is that the current data model is built around the central concept of sample_id. Covariates are attached to sample_ids, and measurements from assays are attached to sample_ids.

As FacileDataSet can handle multi-assays, a single "biological entity", e.g. a patient, can have measurements available from different assays, with potential different sample_ids. And, one will eventually be interested to build a joint table with features coming from different assays

A notion of parent/child relationship between samples is already supported thanks to an existing sqlite table called 'sample_info', but the machinery to browse this tree of relationship still needs to be put in place.

A use case challenging the current data model:

How to allow a query on the FacileDataSet to get a table where a row would be a patient, and columns would be a combination of the patient covariates from sample_A, and the measured variables from sample_B and sample_C

Am I missing something from the way we framed the problem in the past ?

lianos commented 6 years ago

@VRouilly: Just want to say that although we've burned many hours talking about this in the bldg 14 break room, your framing of the problem in this way makes it a bit less daunting than I found myself thinking it was back then.

Your suggestion (I think) to explicitly exploit that fact that we can have datset,sample_id tuples in the sample_covariate table that aren't tied to any assay that can act as "the root" seems to make this more tractable again (is that what you're saying)?

I'm, of course, saying this w/o really having tried to sit down and chew this idea much more, but I'm just telling you how I'm feeling at this point :-)

Maybe we specify a "protected" type value to use in the sample_covariate table that signifies this covariate is being tied to particular level of this "covariate inheritence" hierarchy ...

Or maybe the right "hack" is to rather put add a type (or similar) column in the sample_info table, where one such value is used to explicitly state that a particular dataset,sample_id tuple is meant to be a "meta" sample,whose sole purpose is to collect covariate information that is meant to propogate "down" the sample hierarchy (ie. "root" would be the person, you can use something else to specify "visit", etc.)

You know?

VRouilly commented 6 years ago

Reading your reply @lianos , it brings back good memories from our brainstorming sessions.

I agree that there is this notion of 'root' which should relate to the biological unit (animal, cell line, patient). What I am not quite sure is if we need to accommodate for more than 2 levels (meaning more than just root and related samples).

Looking at the sampleMap structure from MAE, they seem happy with only 2 levels. However, it is not clear to me how they handle the time notion, like a patient visit, or an experimental timepoint. Maybe, measurement from different visits get stored in different assays (RNAseq_V1, RNAseq_V2).

Also, I quite like the idea of having the ability to attach pData either at the level of the root, or at the level of the samples when it makes more sense (for example a QC flag on a particular sample of an assay, or if the sample comes from normal/tumour tissue). It seems that the sampleMap approach does not allow this type of aggregation of pData from the 2 level graph.

At the end, are we ok to only have 2 levels, or do we think we need to handle a deeper tree of annotations ? Are we in a position to define the type of "information hierarchy" that FacileDataSet queries should support ? It could help us to list these queries. Something like:

here is a first go at defining a hierarchy we could aim at supporting

is it too ambitious ? is not detailed/flexible enough ? I get confused again, better to have a break now ;)

phaverty commented 6 years ago

I need to think about this more, but I guessing that a minimum the key for an assay column has to be the tuple (assay, sample, parent) there is some redundancy there as parent is given by sample, but doing that join everywhere would be too hard. We'd also get an 80% solution by leaving it up to the dataset creator to roll everything up to patient. gCellGenomics does that, but we don't need replicates for differential expression. When we meet up to talk about timelines, this will be one of the important issues. For, me its behind a few other key things, but I'm open to discussion.

lianos commented 6 years ago

@phaverty I need to chew on it some more, but I just don't think that keeping the schema as-is and just rely on the "sample covariate tree climbing" to leverage the parent_id in the sample_info table will be too difficult ...

As for priorities, I agree that this can still be a bit further down the road (although I'm happy @VRouilly put this down here so we keep it "on the brain"). Being able to do this successfully will allow us to better model some more complex relationships in a spiffy manner, but we can get away with a lot already the way things are ...