Rework `setup_anndata()` to use CategoricalObsField

sjfleming commented 1 year ago

Rework setup_anndata() to use CategoricalObsField instead of ObsmField (https://docs.scvi-tools.org/en/stable/tutorials/notebooks/data_tutorial.html#Recording-AnnData-state-with-object-registration). This will reduce data pre-processing for the user. Right now, the user has to take a categorical field from adata.obs and turn it into a (one-hot) encoded adata.obsm field.

Along with this, the following input arguments for CellCapModel are kind of annoying for a user to have to supply, because if they make a mistake, the code will error.

Current input arguments include

n_labels
n_drug
n_donor

These inputs can all be inferred from adata. For example, if "donor_key" points to adata.obs["donor"], then we know that n_donor is adata.obs["donor"].nunique(). In CellCap, we can figure out how to pull the relevant information from the registered adata and we can pass those parameters to CellCapModel, instead of relying on the user to input those values.

sjfleming commented 1 year ago

It looks like adata_manager.summary_stats is the way to access the information we want: https://docs.scvi-tools.org/en/stable/tutorials/notebooks/data_tutorial.html#Summary-Stats

ImXman commented 6 months ago

It may be tricky to use adata.obs["perturbation"].nunique() to set up design matrix P. The Norman et al data has obs column storing perturbation info as format like "CEBPA" for single perturbation and "CEBPE_CEBPA" for double perturbations. Design matrix P is in the shape of the number of unique single perturbation.

I also noticed other datasets have more than 2 perturbations in each cell, and they may use "," or "/" to separate individual perturbation. Some have more complex experimental design. For example, immune cells are treated chemical drug like interferon alpha as well as CRISPR knockdown.

I feel a simple function using one column of obs would not accommodate different kinds of experimental designs.

sjfleming commented 6 months ago

You bring up a good point.

CellCap has two kinds of design matrix: $P{np}$ and $D{nc}$ in the manuscript.

There are cases where these could be constructed from various adata.obs columns, but there are also cases where these cannot be inferred from an adata.obs column.

For example, if we care about batch and donor as covariates, we can infer $D_{nc}$ from adata.obs['batch'] and adata.obs['donor'].

In the case you mentioned, $P_{np}$ cannot be inferred from adata.obs['perturbation'], and so we also cannot infer n_drug (we also need to rename these variables, because n_drug is not always a correct description).

I think a good way to handle this could be to have the user input the design matrices $P{np}$ and $D{nc}$ as adata.obsm slots, rather than adata.obs columns.

We can still get rid of input arguments like n_donor, n_drug, and n_labels, because n_drug would be $(P_{np})$.shape(1), which would be something like adata.obsm['perturbation_design_matrix'].shape(1).

sjfleming commented 6 months ago

Check out this

pd.get_dummies(adata.obs[['batch', 'donor']]).astype(int)

as an example of how a user could create the $D_{nc}$ matrix

broadinstitute / CellCap

Rework `setup_anndata()` to use CategoricalObsField #15