angelolab / ark-analysis

Integrated pipeline for multiplexed image analysis
https://ark-analysis.readthedocs.io/en/latest/
MIT License
70 stars 25 forks source link

AnnData Conversion Design Doc #1073

Closed srivarra closed 8 months ago

srivarra commented 10 months ago

This is for internal use only; if you'd like to open an issue or request a new feature, please open a bug or enhancement issue

Design Doc - AnnData Conversion

This design doc proposes a new schema utilizing AnnData. AnnData is a data standard developed and maintained by Scverse with the goal of handling annotated data matrices in memory and on disk. It contains many efficient features from sparse data, lazy operations (Dask), PyTorch support as well as Zarr support for disk storage.

Background

In order to address Vitessce support and Napari integration, it's ideal to convert our Cell Table + misc files to AnnData. This also comes with many additional benefits for our pipeline, and would be a significant improvement over our current data schema.

While our current data schema can be "jury-rigged" for ingestion with Vitessce and Napari, I would strongly recommend against this for the following reasons:

  1. Significantly more maintenance to account for updates, and solving edge cases with user's data. (most important)
    1. In addition this would mean that we would have to maintain two separate schemas, one for Vitessce/Napari and one for our current pipeline.
  2. Performance inefficiencies with large, non-lazy supported files with Pandas, XArray, etc...
    1. GPU acceleration is also in the works for AnnData which would be a huge benefit for our pipeline. See rapids-singlecell.
  3. Utilizing AnnData comes with the following non-technical benefits:
    1. Good support from Scverse / AnnData developers through their documentation, discourse, zulip, etc... They've been very helpful in my experience.
    2. Community engagement and more users for our algorithms, pipelines and workflows:
      1. Many users already use AnnData / Scverse software packages, and it would make it easier for other community members to utilize our workflows.
  4. We can take advantage of Community packages developed with AnnData for analysis, broadening the set of tools available to us. Here are a few interesting examples:
    1. Squidpy: Spatial Single Cell Analysis in Python
      1. Notable features include built in Graph based Spatial clustering, neighborhood enrichment, centrality, and plotting features.
    2. pertpy: Used for perturbation data from control and perturbed samples (used for comparing the effects of known drugs to new drugs).
    3. scvi-tools: A collection of models, and utilities for developing probabilistic methods for single cell data.
  5. Reduction in writing custom code for data ingestion and more time for developing algorithms, improving the user workflow, documentation, and other improvements.
  6. Built in PyTorch support, making writing novel Deep Learning models easier and very efficient.

Design Overview

AnnData Approach

image

AnnData is a data structure consisting of matrices, annotated by DataFrames and Indexes.

A AnnData object is composed of the following components:

Each of these components have specific use cases and will be described below:

1. X, var, obs

image

X is a matrix of shape (n_obs, n_vars) where n_obs is the number of observations and n_vars is the number of variables.

For Ark and Single Cell Spatial Analysis, n_obs is the number of segmented regions or objects of interest. These can be cell segmentations, or more complex objects such as nuclei masks, or object masks. Whatever it is, it should be the smallest, most atomic unit of analysis.

obs_names is a Pandas Index where each value is a unique identifier for each observation. This is usually a string, but can be any hashable type. These must be unique and there are plenty of helper utilities to ensure this, and create unique IDs if needed.

n_vars is the number of variables, which can be any number of features. In non-spatial single cell analysis this is usually the number of genes. For Spatial Analysis this would be aggregated channel information for each segmented region, for each channel. AKA, the channel subset of our table. The names of these channels would be the var_names Index, similar to obs_names.

var is a DataFrame of shape (n_var_features, n_vars), where the index is var_names. Here you can add attributes to each channel, not entirely sure what would work here, but it's a good place to store metadata about each channel nonetheless.

obs is a DataFrame of shape (n_obs, n_obs_features), where the index is obs_names. Here you can add attributes to each observation (segmentation), such as the centroid, area, moments, etc... Essentially this would be where the region properties would be stored. It's best to keep physical / geometric attributes here. We would also store Pixie clusters, and Nimbus predictions here as a column each.

2. obsm, varm

image

obsm is a Matrix of shape (n_obs, a), where a is any arbitrary integer. This contains observation-level matrices, and we use a mapping str -> Matrix to store them. For example, X_umap would store the UMAP embedding of the sparse matrix X, and X_pca would store the PCA embedding of X.

varm is a Matrix of shape (n_vars, b), where b is any arbitrary integer. This contains variable-level matrices, and we use a mapping str -> Matrix to store them. For example, Marker_umap. In addition, this creates an open slot for various marker-level embeddings, additional derived features, etc...

3. obsp, varp

image

obsp is a square matrix of shape (n_obs, n_obs), and its purpose is to store pairwise computations between observations. For example, the results for Neighborhood Analysis can be saved here.

varp is a square matrix of shape (n_vars, n_vars), and its purpose is to store pairwise computations between observations. While I cannot think a current use case of this, it could be useful for Rosetta in Toffy for example. Similar to varm this is an open slot for further use cases we may find.

4. uns

image

uns is a free slot for storing anything! It's a mapping from a string label to whatever we want. This can take the form of str -> Union[DataFrame, Matrix, list, str, etc...]. Ideal for storing "cohort-level" metadata, such as colors, plotting conventions, styles, etc...

Transitioning to AnnData and Implementing it in our pipeline

There are several ways in which we can make the transition over to AnnData.

  1. Gut the current Cell Table logic in ark-analysis and refit the codebase with AnnData support. Almost a rewrite.
    1. Rewriting core logic of every file, and almost every function.
  2. Create a notebook / utility functions which will convert our current data schema to a AnnData for users to work with other libraries downstream.
  3. SpatialData implementation of Ark. MVP at angelolab/ark-spatial. Will discuss further on Thursday.

We should solidify how we transition over to AnnData.

Usage Examples / Pseudocode

The following section will contain some examples of how we can utilize AnnData.

1. Creating a AnnData object from scratch and saving it

Assuming we have an exploded view of our cell table, with the following components: cell_table_markers, cell_table_regionprops, channels, umap_matricies we can reconstruct an AnnData object from these components.

from anndata import AnnData

adata = AnnData(X=cell_table_markers, obs=cell_table_regionprops, var=channels, obsm=umap_matricies)

We can save it to disk with the following:

# Prefer `.ome.zarr` instead for the file extension, as it's more "technically correct" than just ".zarr"
# Doesn't change anything otherwise
adata.write_zarr("my_cohort.ome.zarr")

2. Concatenation of several AnnData Objects

Assuming we have several AnnData objects we can concatenate them together in the following methods:

list[AnnData] -> AnnData

Here we concatentate a list of AnnData objects into a single AnnData object.

from anndata import concat

adata = concat([adata1, adata2, adata3])

list[AnnData] -> AnnCollection Here we concatenate a list of AnnData objects into a single AnnCollection object. It lazily subsets data via a joint index of observations and variables. It also supports lazy-eval on-the-fly DataFrame operations, like selection, reduction, etc...

See the tutorial for AnnCollection. There is also an associated PyTorch data loader, AnnLoader


from anndata.experimental import AnnCollection, AnnLoader

dataset = AnnCollection([adata1, adata2, adata3])

# You can also always convert a AnnCollection to a AnnData object
adata = dataset.to_adata()

dataloader = AnnLoader(dataset, batch_size=256, use_cuda=False)

3. Common Operations

You can also perform generic DataFrame operations on a AnnData / AnnCollection object. Let's say we would like to subset our AnnData object to only include cells with a certain property, such as a certain cell type. We can do this with the following:


adata_dataset[adata_dataset.obs["pheotype"] == "T Cell"]

This should look very familiar to the Pandas API. You can also perform groupby's, map-reduction methods and queries just like Pandas.

Storage and Ingestion

AnnData files can be stored in a couple of ways, however the one I am focusing on is Zarr. H5AD is another available option, as is LOOM.

Zarr is a storage format designed for NumPy arrays, and any "Array Like" data structure (including DataFrames). It's compressible, "chunk-able", and streamable (i.e. read parts of a large file from a server far away, Vitessce takes advantage of this property).

Zarr is especially ideal for parallel read and write operations. For example, multiple writers can operate on a set of chunks, as long as they do not write to the same chunk. This is handled for us through AnnData and Dask, so we don't have to worry about it. In addition, Zarr use several backends, from the common FSSPEC, to Redis and SQLite for more complex configurations.

In addition, the format is being utilized by an increasing amount of Vendors, see the following Nature Technology Feature.

Many geospatial workflows are also utilizing Zarr for storage, see Pangeo for example, it's their preferred storage format.

Misc Benefits

In this section I've listed a few miscellaneous benefits for transitioning over to AnnData which didn't fit in with the main talking points.

  1. Cohort File / Structure simplicity. Currently, a user's cohort consists of many nested folders, with csvs, pickle formats (one of the spatial workflows uses this I believe) and feather files.
    1. We can replace all of them with a AnnData object, and write some simple functions to output csvs if users would like them.
  2. It provides a foundational data structure which is something Ark explicitly does not have.
  3. R users developing with Seurat can convert their Seurat objects to AnnData objects.
  4. We can utilize many data / model frameworks allowing users to organize their workflows in a more modular fashion, with reporting, logging, and other features built in.
    1. LaminDB - Data Framework for Biology
    2. MLflow

Timeline

TBD

Overall, AnnData provides a well-structured, performant, open-source, and flexible data structure for our pipeline. It's well-supported, and has a large community, and lowers the barrier of entry for community users to implement our pipelines and algorithms.

PRs

We should consider making a Milestone for extremely large and complex features with many moving parts.

srivarra commented 10 months ago

@jranek @camisowers @alex-l-kong @ngreenwald Let me know what you guys think.

ngreenwald commented 10 months ago

This is great, thanks! Is it fair to say that once we have the cell table, converting it to AnnData format would be quite easy? These lines make me think that it will be super easy to adapt any downstream analyses:

image

If that's the case, I definitely thinks it makes sense to start adopting it as we develop new downstream analysis code we write

srivarra commented 10 months ago

@ngreenwald Yes the conversion would be reasonably straightforward. Most of the work would be converting all the functions to make use of AnnData.

alex-l-kong commented 10 months ago

This looks great! We should think about splitting up the conversion process into multiple smaller PRs, each associated with one component of the pipeline that needs to use AnnData.

Perhaps creating an ann_data PR that serves as a temporary "main" for this conversion process, then incrementally build this up with several sub-PRs.