Open atolopko-czi opened 11 months ago
@pablo-gar - Thanks for product requirements write up. Some thoughts (@atolopko-czi @atarashansky - please advise as well):
As an alternative, one could specify: "feature filter", "group A obs query", "group B obs query" and get rid of "obs filter". This more explicitly identifies group A and group B. I am not advocating for it. In fact, there are failure modes with the latter in that one could specify a "group A" and "group B" that are entirely disjoint in that there are no groups of obs rows between the 2 populations greater than 1.
So is the approach you laid out: "obs filter", "feature filter", "group A assertion", "group B assertion" the convention that users will be most comfortable with rather than any other convention? For example, the current code requires the user to provide: "obs filter query" and a "treatment" (ex: sex_ontology_term_id
) attribute such that the "treatment" attributed has only 2 distinct values in result set returned by the "obs query filter" - this effectively determines "group A" and "group B". Is this way intuitive enough for the users?
I think alignment on (1), (2), and (3) is enough to get started with a tech spec
As an alternative, one could specify: "feature filter", "group A obs query", "group B obs query" and get rid of "obs filter". This more explicitly identifies group A and group B. I am not advocating for it.
Note that you'd still have to verify that each individual query has single-valued treatment column and that the two query results produce rows with different treatment values.
In fact, there are failure modes with the latter in that one could specify a "group A" and "group B" that are entirely disjoint in that there are no groups of obs rows between the 2 populations greater than 1.
They should be entirely disjoint, but maybe I'm misunderstanding your concern.
Note that you'd still have to verify that each individual query has single-valued treatment column and that the two query results produce rows with different treatment values.
@atolopko-czi - So are you saying that each group is distinguished by the values in the treatment-column
such that the intersection of unique values for treatment-column
between groupA and groupB is the empty set? That is, set(groupA.treatment.unique_values) & set(groupB.treatment.unique_values)
is {}
(empty)?
If so, then that implies that len(treatment.unique_values)
across both groups could be any positive integer, say 5, but our current implementation asserts that len(treatment.unique_values) == 2
across both groups right?
They should be entirely disjoint, but maybe I'm misunderstanding your concern.
@atolopko-czi - Oh I see. They are entirely disjoint precisely because set(groupA.treatment.unique_values) & set(groupB.treatment.unique_values) == {}
correct?
@pablo-gar @atolopko-czi - Regarding performance requirements
Can complete under 10 min for large comparisons millions of cells.
The new manuscript mentions that this case of 10^6
cells takes 13 min on single core
(non-parallelized) and 2-3 min on 6 cores
(parallelized) for the vanilla method (using bootstrapping). I assume most of the census users will have a multicore machine available (at least 2 cores if not 4+ cores).
Is the runtime requirement of 10 min for a non-parallel implementation of the census-memento? (I am calling it the census memento vs vanilla memento to distinguish our implementation from the sampling and bootstrapping implementation in the paper)
[Re-written by @pablo-gar]
Context
Census launched in 2023 with its baseline functionality for efficient data access to the largest aggregation of standardized single-cell data, providing a stable data structure and API.
Census has demonstrated its utility to accelerate workflows for “expert data analysts” reducing >3 months of work to less than a day, in particular around data gathering and wrangling. Please the user segments document.
In 2024 we aim to continue our support of “expert data analysts” by improving upon existing features around data access patterns, but our bigger focus will become “data and tool consumers” and high-value “tool builders” (defined as those with high reach to “data and tool consumers”).
One of the main workflows that “data and tool consumers” follow when analyzing single-cell data is differential gene expression to identify underlying biological sources that differentiate different groups of cells.
This epic encapsulates user-stories and requirements for Census to support differential gene expression.
Stories
Product requirements
An experimental API that can perform differential gene expression between any two groups of cells from Census as defined by the Census metadata while accounting for known batch effects.
Refinements needed
Differential gene expression challenges with Census-scale data
Potential methods to perform comparisons
Memento
WIP
T-test with batch effect correction (liner regression with covariates)
WIP