Open sjfleming opened 1 year ago
It looks like adata_manager.summary_stats
is the way to access the information we want:
https://docs.scvi-tools.org/en/stable/tutorials/notebooks/data_tutorial.html#Summary-Stats
It may be tricky to use adata.obs["perturbation"].nunique() to set up design matrix P. The Norman et al data has obs column storing perturbation info as format like "CEBPA" for single perturbation and "CEBPE_CEBPA" for double perturbations. Design matrix P is in the shape of the number of unique single perturbation.
I also noticed other datasets have more than 2 perturbations in each cell, and they may use "," or "/" to separate individual perturbation. Some have more complex experimental design. For example, immune cells are treated chemical drug like interferon alpha as well as CRISPR knockdown.
I feel a simple function using one column of obs would not accommodate different kinds of experimental designs.
You bring up a good point.
CellCap has two kinds of design matrix: $P{np}$ and $D{nc}$ in the manuscript.
There are cases where these could be constructed from various adata.obs
columns, but there are also cases where these cannot be inferred from an adata.obs
column.
For example, if we care about batch
and donor
as covariates, we can infer $D_{nc}$ from adata.obs['batch']
and adata.obs['donor']
.
In the case you mentioned, $P_{np}$ cannot be inferred from adata.obs['perturbation']
, and so we also cannot infer n_drug
(we also need to rename these variables, because n_drug
is not always a correct description).
I think a good way to handle this could be to have the user input the design matrices $P{np}$ and $D{nc}$ as adata.obsm
slots, rather than adata.obs
columns.
We can still get rid of input arguments like n_donor
, n_drug
, and n_labels
, because n_drug
would be $(P_{np})$.shape(1), which would be something like adata.obsm['perturbation_design_matrix'].shape(1)
.
Check out this
pd.get_dummies(adata.obs[['batch', 'donor']]).astype(int)
as an example of how a user could create the $D_{nc}$ matrix
Rework
setup_anndata()
to useCategoricalObsField
instead ofObsmField
(https://docs.scvi-tools.org/en/stable/tutorials/notebooks/data_tutorial.html#Recording-AnnData-state-with-object-registration). This will reduce data pre-processing for the user. Right now, the user has to take a categorical field fromadata.obs
and turn it into a (one-hot) encodedadata.obsm
field.Along with this, the following input arguments for
CellCapModel
are kind of annoying for a user to have to supply, because if they make a mistake, the code will error.Current input arguments include
These inputs can all be inferred from
adata
. For example, if "donor_key" points toadata.obs["donor"]
, then we know thatn_donor
isadata.obs["donor"].nunique()
. InCellCap
, we can figure out how to pull the relevant information from the registeredadata
and we can pass those parameters toCellCapModel
, instead of relying on the user to input those values.