Closed srivarra closed 11 months ago
See #1079 for the notebook where you can run it locally.
@alex-l-kong @camisowers @jranek @ngreenwald Take a look, run the notebook and let me know your guys' thoughts!
Great, let's discuss today to get everyone's thoughts
AnnData
Conversion Design Document Part 2Relevant Background
The purpose of this design document is to provide guidelines and generate ideas for an in-depth implementation of
AnnData
. It differs to angelolab/ark-analysis#1073, in where that was a high level overview on why. Here we will discuss how we will transition to this format.Design Overview / Code Walkthrough
Currently, we have decided to transition
AnnData
after Notebook 4, as it's a good point to get our feet wet without rewriting a large chunk ofArk
.Let's consider all the data associated with single FOV from the Example Dataset, FOV 0.
FOV 0 is $512px \times 512px$ image, consisting of $669$ individual cells.
We will walk through converting a majority of the data associated with FOV 0 to an
AnnData
table, as it's easiest to understand the conversion process by walking through a concrete example.You will need to run the following notebooks beforehand:
Let's get the following packages installed:
AnnData
Dask
Ray
optional, only used here to provide a superior scheduler forDask
Import packages, set up a local
Dask
cluster and get the example data.Cell Table (Post Notebook 4)
The
cell_table_size_normalized_cell_labels.csv
houses the following informationFor each cell we have the following types of metrics:
label
: The unique segmentation label for each cell, (integer
)markers
: The intensity of each marker for each cell, (float
), specifically the following markers:region properties
: This consists of physical metadata or measurements of each cell, (int
,float
), specifically the following:cell_size
,area
,eccentricity
,major_axis_length
,minor_axis_length
,perimeter
,convex_area
,equivalent_diameter
,centroid-0
,centroid-1
,major_minor_axis_ratio
,perim_square_over_area
,major_axis_equiv_diam_ratio
,convex_hull_resid
,centroid_dif
,num_concavities
fov
: A unique identifier for the FOV, (str
)"fov0"
.cell_meta_cluster
: A classification for the cell (str
), in the example dataset we have:To convert these to
AnnData
components we have:X
: The values for themarkers
obs
: Housesregion properties
,fov
,cell_meta_cluster
,labels
var_names
: The name of themarkers
(channel names)obs_names
: The celllabel
IDsSince we also have transformations ($\tt{arcsinh}$) of the cell-by-markers matrix, we should add those in as an additional
layer
forX
.We will use
Dask
DataFrames
to build up a computation graph and only execute it when needed.Subset the cell tables to only extract the channels for
var
.Subset the cell tables to only extract the properties for
obs
.Make sure to use Categorical
Dtypes
as they provide very efficient indexing / sub-setting for well, categorical properties.Here we create the
AnnData
object withX
,obs
and the $\tt arcsinh$ transformation ofX
. You might see the following warningImplicitModificationWarning
, where it converts the current index (usually an integer) to a string.The
layers
parameter is a mapping from a string key to a matrix of the same shape asX
, representing the same data just transformed by an operation, linear or otherwise. You can have multiple layers as well.X
gets saved as a NumPyNDArray
.We can view
X
as aDataFrame
by using theto_df()
method.Since we created that layer in the initialization of
fov0_adata
, we can also view it'sDataFrame
representation ofX
.Additional Components for
SquidPy
SupportIf we want to make use of
SquidPy
's Spatial Analysis tools, we need to add the following components to ourAnnData
object:spatial_key
: Several functions inSquidPy
make use of this key to identify the spatial coordinates of each cell ($X$, $Y$). This key lives inobsm
and is a 2D matrix of shape $n_{obs} \times 2$.gr.spatial_neighbors
We will also rename the
Centroid 0
andCentroid 1
columns to something more meaningful to demonstrate modifying anAnnData
object.Spatial Analysis Components
Here we will discuss how to convert many (not all) of the spatial analysis results to
AnnData
components.Distance Matrices
The distance matrices get saved as a
netcdf3
file, which can be loaded in withxarray
.The shape of the distance matrix for FOV 0 is $669 \times 669$, which we can generalize to of
n_obs
$\times$n_obs
.We can extract the underlying NumPy array from the
xarray.DataArray
with the.data
attribute.A quick sidenote on
.data
, it preserves the underlying array type, be it a Dask Array, in memory NumPy array, SciPy sparse array, or a CuPy GPU Array.obsp
is a container to store pairwise annotations of observations, so, the pairwise distances should be stored here. We'll use the key"distance"
for now to identify the distance matrix.We can view the current
AnnData
object with the newly added"distance"
attribute forobsp
.The neighborhood counts and frequencies are stored in the following files:
neighborhood_counts-cell_meta_cluster_radius50.csv
neighborhood_freqs-cell_meta_cluster_radius50.csv
We subset on those associated with FOV 0 and extract only the unique
cell_meta_cluster
labels. This gives us the following:The shapes would be
n_cell_meta_cluster
$\times$n_cell_meta_cluster
which is square, but doesn't fit in with any of the otherAnnData
specs for storing pairwise information. These would be stored inuns
.Cell Neighbors Analysis
The neighborhood diversity would be ideal for
obs
since there is a value per cell for each FOV. This would be an additional column. We can perform an outer merge withobs
DataFrame and theneighborhood_diversity_cell_meta_cluster_radius50.csv
DataFrame per FOV, on thelabel
column.Let's just select the diversity for FOV 0.
uns
is a catch-all container (dictionary) for unstructured annotations of the object. It's a good place to store parameters, metadata, and other information that doesn't fit into the other components ofAnnData
. In this case, we'll store the radius value for the neighborhood analysis.The cell distances in
cell_meta_cluster_avg_dists-nearest_5.csv
contains the average distance of the $k$ the closest cells. The shape of this isn_obs
$\times$n_cell_meta_cluster
. This shape fits withobsm
, so we can store it there or inuns
.While in genomic analyses, users generally place embeddings of their
X
matrix inobsm
, we can also place the cell distances here as well, anything goes as long as the shapes fit, we just have to make sure to document it.I have placed the
cell_meta_cluster_avg_dists-nearest_5
inobsm["cell_distances]
.Mixing Scores
The mixing scores are stored in the following files:
{popultation 1}_{population 2}-{mixing type}_mixing_score.csv
, and with the example dataset they areCD4_CD8-homogeneous_mixing_score.csv
,Just like the others, we will subset on just FOV 0.
This file has the columns:
fov
,mixing_score
,cell_count
,cell_ratio
, and a row per FOV.There are a few ways to add this to the
AnnData
object, the first is to add themixing_score
,cell_count
, thecell_ratio
,population 1
andpopulation 2
into theuns
such as shown below:Another approach would be to transform the mixing scores into Matrices, one for each FOV,
For example, here is the mixing score matrix for FOV 0, where (CD4T, CD8T) has been filled.
And fill out all the values for each FOV. This would be stored in
uns
as well. We would also store matrices like these forcell_count
andcell_ratio
fields.For now, I've stored it in
uns
using the first, nested dictionary approach.Fiber Segmentation
The fiber segmentation generates two files of interest:
fiber_object_table.csv
fov
,label
,centroid-0
,centroid-1
,major_axis_length
,minor_axis_length
,orientation
,area
,eccentricity
,euler_number
,alignment_score
fiber_stats_table.csv
fov
,pixel_density
,fiber_density
,avg_major_axis_length
,avg_minor_axis_length
,avg_orientation
,avg_area
,avg_eccentricity
,avg_euler_number
,avg_alignment_score
The fiber segmentation notebook does not make use of the cell level segmentations, and instead creates computes the fiber level segmentations from the raw image.
This one I'm not too sure about, as it doesn't look like there is a
X
equivalent here, so creating a uniqueAnnData
object may not be ideal. If we could,MuData
could be useful as it's a way to represent multi-modalAnnData
tables.Perhaps place it in
uns
? Let's place inuns
for this example.Neighborhood Analysis
The neighborhood analysis notebook generates three files of interest:
cell_table_size_normalized_cell_labels_kmeans_nh.csv
kmeans
neighborhood columnneighborhood_marker.csv
kmeans_neighborhood
,*var_names
(all the markers)neighborhood_cell_type.csv
kmeans_neighborhood
,*unique_clusters
(all the pixie clusters)The
neighborhood_marker.csv
could be stored invarm
. Here we've transposed it to fit theAnnData
spec.Similar to
X
we can viewvarm
as aDataFrame
by using theto_df()
method.The
neighborhood_cell_type.csv
could be stored inuns
due to it's shape.Final
AnnData
TableOur final
AnnData
table for FOV 0 now looks like this:Some other things to go over would be the Spatial LDA notebook, and I'm sure I've missed a few other metrics as well. But I think this gets an idea of the general structure and how we can convert our data to
AnnData
components. We can discuss the specifics of which parameters get placed where of course.One
AnnData
Table per Cohort or OneAnnData
Table per FOV?One of the main design decisions we have to make is whether to store all the data in one
AnnData
object, or to split it up into manyAnnData
objects.I am in favor of a single
AnnData
object per FOV. This is mainly because the majority ofScanPy
andSquidPy
functions tend to work best on an individual "image" by "image" basis.If we have a single
AnnData
object per FOV, the square shaped components might not make sense. For example, the takedistance
component ofobsp
which is a 2D array consisting of the distances between cell $i$ and cell $j$ for all cells in the FOV. If we have a singleAnnData
object, this matrix would take up a much larger footprint, and would require some rather obtuse indexing to get the distances for a single FOV.This will quickly take up $\mathcal{O} (n^2)$ space which is not feasible. The majority of it will be empty though and can be represented as a Block Diagonal Matrix.
$$\mathbf{D}_{cohort} = \begin{bmatrix} \mathbf{D}_0 & \mathbf{0} & \cdots & \mathbf{0} \ \mathbf{0} & \mathbf{D}_1 & \cdots & \mathbf{0} \ \vdots & \vdots & \ddots & \vdots \ \mathbf{0} & \mathbf{0} & \cdots & \mathbf{D}_n \end{bmatrix}$$
The SPAIN cohort has about 400 FOVs. While taking advantage of sparsity and
Dask
can make the overhead of loading the data into memory more manageable. This kind of setup would (generously) upper bound the storage to $\mathcal{O} (n\log{n})$. This is completely feasible, but it can make working with the data more difficult.In addition,
SpatialData
is converting to a manyAnnData
table approach, where a table can be associated to 0,1 or many images / coordinate systems. See scverse/spatialdata#298 for the discussion on this.Does this mean we need to run a
for
loop every time we want to run a particular function across all FOV-levelAnnData
tables (such asscanpy.pp.filter_cells
)?No, we can use
AnnCollection
to do this. We can lazily concatenate theAnnData
objects and perform operations on them as if it was one bigAnnData
object, internally it'll load eachAnnData
object into memory as needed withDask
(done automatically for us) and map a function to it.Internally, the majority of functions in
Ark
load the whole cell table, but end up iterating by FOV anyway, so this can simplify some functions.As far as I know the benefits with one table per cohort start and end with having a single file stored to disk. I'm sure there are more, but I wasn't able to find them. Let me know if there are any others.
Storage: Saving and Loading
Let's take a quick detour to discuss storage.
For persistent storage we should save the
AnnData
table as aZarr
store.Zarr
provides a nice interface for storingAnnData
objects, as it's a hierarchical key-value store, which is exactly whatAnnData
is.Unfortunately, we will have to write to the
Zarr
store once we are done modifying theAnnData
table.SpatialData
provides a psuedo-"backed" feature to write modifications to disk immediately.Where
backed
is defined as the property where in memory changes are written to disk immediately.Zarr
backedAnnData
stores are sort of slowly in the works. See scverse/anndata#219.We can read
Zarr
backedAnnData
Tables usingAnnData.read_zarr()
.Lets' take a look at what the saved
Zarr
file saved at various depths:Depth 1 Hierarchy
```shell . ├── layers ├── obs ├── obsm ├── obsp ├── uns ├── var ├── varm ├── varp └── X ```Depth 2 Hierarchy
```shell . ├── layers │ └── arcsinh ├── obs │ ├── _index │ ├── area │ ├── cell_meta_cluster │ ├── cell_size │ ├── Centroid X │ ├── Centroid Y │ ├── centroid_dif │ ├── convex_area │ ├── convex_hull_resid │ ├── diversity_cell_meta_cluster │ ├── eccentricity │ ├── equivalent_diameter │ ├── fov │ ├── kmeans_neighborhood │ ├── label │ ├── major_axis_equiv_diam_ratio │ ├── major_axis_length │ ├── major_minor_axis_ratio │ ├── minor_axis_length │ ├── num_concavities │ ├── perim_square_over_area │ └── perimeter ├── obsm │ ├── cell_distances │ └── coordinates ├── obsp │ └── distance ├── uns │ ├── fiber_object │ ├── fiber_stats │ ├── mixing_scores │ ├── neighborhood_cell_type │ ├── neighborhood_counts │ ├── neighborhood_diversity │ └── neighborhood_freqs ├── var │ └── _index ├── varm │ └── kmeans ├── varp └── X └── 0.0 ```Depth 3 Hierarchy
```shell . ├── layers │ └── arcsinh │ └── 0.0 ├── obs │ ├── _index │ │ └── 0 │ ├── area │ │ └── 0 │ ├── cell_meta_cluster │ │ ├── categories │ │ └── codes │ ├── cell_size │ │ └── 0 │ ├── Centroid X │ │ └── 0 │ ├── Centroid Y │ │ └── 0 │ ├── centroid_dif │ │ └── 0 │ ├── convex_area │ │ └── 0 │ ├── convex_hull_resid │ │ └── 0 │ ├── diversity_cell_meta_cluster │ │ └── 0 │ ├── eccentricity │ │ └── 0 │ ├── equivalent_diameter │ │ └── 0 │ ├── fov │ │ ├── categories │ │ └── codes │ ├── kmeans_neighborhood │ │ └── 0 │ ├── label │ │ └── 0 │ ├── major_axis_equiv_diam_ratio │ │ └── 0 │ ├── major_axis_length │ │ └── 0 │ ├── major_minor_axis_ratio │ │ └── 0 │ ├── minor_axis_length │ │ └── 0 │ ├── num_concavities │ │ └── 0 │ ├── perim_square_over_area │ │ └── 0 │ └── perimeter │ └── 0 ├── obsm │ ├── cell_distances │ │ ├── _index │ │ ├── APC │ │ ├── Bcell │ │ ├── CD14_monocyte │ │ ├── CD4T │ │ ├── CD8T │ │ ├── endothelium │ │ ├── immune_other │ │ ├── M1_macrophage │ │ ├── M2_macrophage │ │ ├── Myofibroblast │ │ ├── other │ │ ├── stroma │ │ ├── tumor_ck17 │ │ └── tumor_ecad │ └── coordinates │ └── 0.0 ├── obsp │ └── distance │ ├── 0.0 │ ├── 0.1 │ ├── 1.0 │ └── 1.1 ├── uns │ ├── fiber_object │ │ ├── _index │ │ ├── alignment_score │ │ ├── area │ │ ├── centroid-0 │ │ ├── centroid-1 │ │ ├── eccentricity │ │ ├── euler_number │ │ ├── label │ │ ├── major_axis_length │ │ ├── minor_axis_length │ │ └── orientation │ ├── fiber_stats │ │ ├── avg_alignment_score │ │ ├── avg_area │ │ ├── avg_eccentricity │ │ ├── avg_euler_number │ │ ├── avg_major_axis_length │ │ ├── avg_minor_axis_length │ │ ├── avg_orientation │ │ ├── fiber_density │ │ └── pixel_density │ ├── mixing_scores │ │ └── 0 │ ├── neighborhood_cell_type │ │ ├── APC │ │ ├── Bcell │ │ ├── CD14_monocyte │ │ ├── CD4T │ │ ├── CD8T │ │ ├── endothelium │ │ ├── immune_other │ │ ├── kmeans_neighborhood │ │ ├── M1_macrophage │ │ ├── M2_macrophage │ │ ├── Myofibroblast │ │ ├── other │ │ ├── stroma │ │ ├── tumor_ck17 │ │ └── tumor_ecad │ ├── neighborhood_counts │ │ ├── _index │ │ ├── APC │ │ ├── Bcell │ │ ├── CD14_monocyte │ │ ├── CD4T │ │ ├── CD8T │ │ ├── endothelium │ │ ├── immune_other │ │ ├── M1_macrophage │ │ ├── M2_macrophage │ │ ├── Myofibroblast │ │ ├── other │ │ ├── stroma │ │ ├── tumor_ck17 │ │ └── tumor_ecad │ ├── neighborhood_diversity │ │ └── cell_meta_cluster │ └── neighborhood_freqs │ ├── _index │ ├── APC │ ├── Bcell │ ├── CD14_monocyte │ ├── CD4T │ ├── CD8T │ ├── endothelium │ ├── immune_other │ ├── M1_macrophage │ ├── M2_macrophage │ ├── Myofibroblast │ ├── other │ ├── stroma │ ├── tumor_ck17 │ └── tumor_ecad ├── var │ └── _index │ └── 0 ├── varm │ └── kmeans │ └── 0.0 ├── varp └── X └── 0.0 ```Misc Notes
The actual data, matrices,
DataFrames
, dictionaries and all are stored in the leaf nodes of the hierarchy as binary data. The rest of the nodes are just metadata, providing column names (if applicable) and other information, like the encoding type and encoding versioning.Not sure where we can save the
AnnData
objects, perhaps under a directory calledtables
in the root of the cohort's dataset?It's also very much worth considering making use of
SquidPy
's tool set for single cell spatial analysis. We should read through their code and learn how they interface withAnnData
when designing future components / redesigning current ones.