Closed srivarra closed 8 months ago
@jranek @camisowers @alex-l-kong @ngreenwald Let me know what you guys think.
This is great, thanks! Is it fair to say that once we have the cell table, converting it to AnnData format would be quite easy? These lines make me think that it will be super easy to adapt any downstream analyses:
If that's the case, I definitely thinks it makes sense to start adopting it as we develop new downstream analysis code we write
@ngreenwald Yes the conversion would be reasonably straightforward. Most of the work would be converting all the functions to make use of AnnData
.
This looks great! We should think about splitting up the conversion process into multiple smaller PRs, each associated with one component of the pipeline that needs to use AnnData.
Perhaps creating an ann_data
PR that serves as a temporary "main" for this conversion process, then incrementally build this up with several sub-PRs.
This is for internal use only; if you'd like to open an issue or request a new feature, please open a bug or enhancement issue
Design Doc -
AnnData
ConversionThis design doc proposes a new schema utilizing
AnnData
.AnnData
is a data standard developed and maintained by Scverse with the goal of handling annotated data matrices in memory and on disk. It contains many efficient features from sparse data, lazy operations (Dask
),PyTorch
support as well asZarr
support for disk storage.Background
In order to address Vitessce support and Napari integration, it's ideal to convert our Cell Table + misc files to
AnnData
. This also comes with many additional benefits for our pipeline, and would be a significant improvement over our current data schema.While our current data schema can be "jury-rigged" for ingestion with
Vitessce
andNapari
, I would strongly recommend against this for the following reasons:Vitessce
/Napari and one for our current pipeline.Pandas
,XArray
, etc...AnnData
which would be a huge benefit for our pipeline. Seerapids-singlecell
.AnnData
comes with the following non-technical benefits:AnnData
developers through their documentation, discourse, zulip, etc... They've been very helpful in my experience.AnnData
/ Scverse software packages, and it would make it easier for other community members to utilize our workflows.AnnData
for analysis, broadening the set of tools available to us. Here are a few interesting examples:Design Overview
AnnData
ApproachAnnData
is a data structure consisting of matrices, annotated by DataFrames and Indexes.A
AnnData
object is composed of the following components:Each of these components have specific use cases and will be described below:
1. X, var, obs
X
is a matrix of shape(n_obs, n_vars)
wheren_obs
is the number of observations andn_vars
is the number of variables.For Ark and Single Cell Spatial Analysis,
n_obs
is the number of segmented regions or objects of interest. These can be cell segmentations, or more complex objects such as nuclei masks, or object masks. Whatever it is, it should be the smallest, most atomic unit of analysis.obs_names
is aPandas
Index where each value is a unique identifier for each observation. This is usually a string, but can be any hashable type. These must be unique and there are plenty of helper utilities to ensure this, and create unique IDs if needed.n_vars
is the number of variables, which can be any number of features. In non-spatial single cell analysis this is usually the number of genes. For Spatial Analysis this would be aggregated channel information for each segmented region, for each channel. AKA, the channel subset of our table. The names of these channels would be thevar_names
Index, similar toobs_names
.var
is aDataFrame
of shape(n_var_features, n_vars)
, where the index isvar_names
. Here you can add attributes to each channel, not entirely sure what would work here, but it's a good place to store metadata about each channel nonetheless.obs
is aDataFrame
of shape(n_obs, n_obs_features)
, where the index isobs_names
. Here you can add attributes to each observation (segmentation), such as the centroid, area, moments, etc... Essentially this would be where the region properties would be stored. It's best to keep physical / geometric attributes here. We would also store Pixie clusters, and Nimbus predictions here as a column each.2. obsm, varm
obsm
is a Matrix of shape(n_obs, a)
, wherea
is any arbitrary integer. This contains observation-level matrices, and we use a mappingstr -> Matrix
to store them. For example,X_umap
would store the UMAP embedding of the sparse matrixX
, andX_pca
would store the PCA embedding ofX
.varm
is a Matrix of shape(n_vars, b)
, whereb
is any arbitrary integer. This contains variable-level matrices, and we use a mappingstr -> Matrix
to store them. For example,Marker_umap
. In addition, this creates an open slot for various marker-level embeddings, additional derived features, etc...3. obsp, varp
obsp
is a square matrix of shape(n_obs, n_obs)
, and its purpose is to store pairwise computations between observations. For example, the results for Neighborhood Analysis can be saved here.varp
is a square matrix of shape(n_vars, n_vars)
, and its purpose is to store pairwise computations between observations. While I cannot think a current use case of this, it could be useful for Rosetta inToffy
for example. Similar tovarm
this is an open slot for further use cases we may find.4. uns
uns
is a free slot for storing anything! It's a mapping from a string label to whatever we want. This can take the form ofstr -> Union[DataFrame, Matrix, list, str, etc...]
. Ideal for storing "cohort-level" metadata, such as colors, plotting conventions, styles, etc...Transitioning to
AnnData
and Implementing it in our pipelineThere are several ways in which we can make the transition over to
AnnData
.ark-analysis
and refit the codebase withAnnData
support. Almost a rewrite.AnnData
for users to work with other libraries downstream.SpatialData
implementation ofArk
. MVP at angelolab/ark-spatial. Will discuss further on Thursday.We should solidify how we transition over to
AnnData
.Usage Examples / Pseudocode
The following section will contain some examples of how we can utilize
AnnData
.1. Creating a
AnnData
object from scratch and saving itAssuming we have an exploded view of our cell table, with the following components:
cell_table_markers
,cell_table_regionprops
,channels
,umap_matricies
we can reconstruct anAnnData
object from these components.We can save it to disk with the following:
2. Concatenation of several AnnData Objects
Assuming we have several
AnnData
objects we can concatenate them together in the following methods:list[AnnData] -> AnnData
Here we concatentate a list of
AnnData
objects into a singleAnnData
object.list[AnnData] -> AnnCollection
Here we concatenate a list ofAnnData
objects into a singleAnnCollection
object. It lazily subsets data via a joint index of observations and variables. It also supports lazy-eval on-the-fly DataFrame operations, like selection, reduction, etc...See the tutorial for
AnnCollection
. There is also an associatedPyTorch
data loader,AnnLoader
3. Common Operations
You can also perform generic
DataFrame
operations on aAnnData
/AnnCollection
object. Let's say we would like to subset ourAnnData
object to only include cells with a certain property, such as a certain cell type. We can do this with the following:This should look very familiar to the Pandas API. You can also perform groupby's, map-reduction methods and queries just like Pandas.
Storage and Ingestion
AnnData
files can be stored in a couple of ways, however the one I am focusing on isZarr
.H5AD
is another available option, as isLOOM
.Zarr
is a storage format designed for NumPy arrays, and any "Array Like" data structure (including DataFrames). It's compressible, "chunk-able", and streamable (i.e. read parts of a large file from a server far away,Vitessce
takes advantage of this property).Zarr
is especially ideal for parallel read and write operations. For example, multiple writers can operate on a set of chunks, as long as they do not write to the same chunk. This is handled for us throughAnnData
andDask
, so we don't have to worry about it. In addition,Zarr
use several backends, from the common FSSPEC, to Redis and SQLite for more complex configurations.In addition, the format is being utilized by an increasing amount of Vendors, see the following Nature Technology Feature.
Many geospatial workflows are also utilizing
Zarr
for storage, see Pangeo for example, it's their preferred storage format.Misc Benefits
In this section I've listed a few miscellaneous benefits for transitioning over to
AnnData
which didn't fit in with the main talking points.csvs
, pickle formats (one of the spatial workflows uses this I believe) and feather files.AnnData
object, and write some simple functions to outputcsvs
if users would like them.Ark
explicitly does not have.Seurat
can convert theirSeurat
objects toAnnData
objects.LaminDB
- Data Framework for BiologyMLflow
Timeline
TBD
Overall,
AnnData
provides a well-structured, performant, open-source, and flexible data structure for our pipeline. It's well-supported, and has a large community, and lowers the barrier of entry for community users to implement our pipelines and algorithms.PRs
We should consider making a Milestone for extremely large and complex features with many moving parts.