srivarra commented 1 year ago

`AnnData` Conversion Design Document Part 2

Relevant Background

The purpose of this design document is to provide guidelines and generate ideas for an in-depth implementation of AnnData. It differs to angelolab/ark-analysis#1073, in where that was a high level overview on why. Here we will discuss how we will transition to this format.

Design Overview / Code Walkthrough

Currently, we have decided to transition AnnData after Notebook 4, as it's a good point to get our feet wet without rewriting a large chunk of Ark.

Let's consider all the data associated with single FOV from the Example Dataset, FOV 0.

FOV 0 is $512px \times 512px$ image, consisting of $669$ individual cells.

We will walk through converting a majority of the data associated with FOV 0 to an AnnData table, as it's easiest to understand the conversion process by walking through a concrete example.

You will need to run the following notebooks beforehand:

Calculate Mixing Scores
Cell Neighbors Analysis
Example Fiber Segmentation
Example Neighborhood Analysis

Let's get the following packages installed:

AnnData
Dask
Ray optional, only used here to provide a superior scheduler for Dask

%pip install dask["distributed"] anndata

%pip install ray["serve"] # optional

Import packages, set up a local Dask cluster and get the example data.

from pathlib import Path
import pandas as pd
from anndata import AnnData
from dask import dataframe as dd
from dask.distributed import Client
from ray.util.dask import enable_dask_on_ray

enable_dask_on_ray() # optional

Client(n_workers=10, threads_per_worker=2)

from ark.utils.example_dataset import get_example_dataset

base_dir = Path("../data/example_dataset")

get_example_dataset(dataset="post_clustering", save_dir=base_dir, overwrite_existing=True)

Cell Table (Post Notebook 4)

The cell_table_size_normalized_cell_labels.csv houses the following information

For each cell we have the following types of metrics:

label: The unique segmentation label for each cell, (integer)
markers: The intensity of each marker for each cell, (float), specifically the following markers:
- CD14, CD163, CD20, CD3, CD31, CD4, CD45, CD68, CD8, CK17, Collagen1, ECAD, Fibronectin, GLUT1, H3K27me3, H3K9ac, HLADR, IDO, Ki67, PD1, SMA, Vim
region properties: This consists of physical metadata or measurements of each cell, (int, float), specifically the following:
- cell_size, area, eccentricity, major_axis_length, minor_axis_length, perimeter, convex_area, equivalent_diameter, centroid-0, centroid-1, major_minor_axis_ratio, perim_square_over_area, major_axis_equiv_diam_ratio, convex_hull_resid, centroid_dif, num_concavities
fov: A unique identifier for the FOV, (str)
- In this case it would be "fov0".
cell_meta_cluster: A classification for the cell (str), in the example dataset we have:
- CD4T, CD8T, CD14_monocyte, Bcell, other, M2_macrophage, immune_other, M1_macrophage, APC, stroma, endothelium, Myofibroblast, tumor_ck17, tumor_ecad

To convert these to AnnData components we have:

X: The values for the markers
obs: Houses region properties, fov, cell_meta_cluster, labels
var_names: The name of the markers (channel names)
obs_names: The cell label IDs

Since we also have transformations ($\tt{arcsinh}$) of the cell-by-markers matrix, we should add those in as an additional layer for X.

We will use Dask DataFrames to build up a computation graph and only execute it when needed.

# Load the size normalized cell table with post-pixie clusters
fov0_cell_table_n = dd.read_csv(
    base_dir / "segmentation/cell_table/cell_table_size_normalized_cell_labels.csv").query("fov == 'fov0'").compute()

# Load the arcsinh transformed cell table (no cluster information in this one from the example dataset)
fov0_cell_table_a = dd.read_csv(base_dir / "segmentation/cell_table/cell_table_arcsinh_transformed.csv").query("fov == 'fov0'").compute()

Subset the cell tables to only extract the channels for var.

var_names = ["CD14", "CD163", "CD20", "CD3", "CD31", "CD4", "CD45", "CD68", "CD8", "CK17", "Collagen1", "ECAD",
             "Fibronectin", "GLUT1", "H3K27me3", "H3K9ac", "HLADR", "IDO", "Ki67", "PD1", "SMA", "Vim"]

X_n = fov0_cell_table_n[var_names]
X_a = fov0_cell_table_a[var_names]

Subset the cell tables to only extract the properties for obs.

obs_cols = ["cell_size", "label", "area", "eccentricity", "major_axis_length",
            "minor_axis_length", "perimeter", "convex_area", "equivalent_diameter",
            "centroid-0", "centroid-1", "major_minor_axis_ratio",
            "perim_square_over_area", "major_axis_equiv_diam_ratio",
            "convex_hull_resid", "centroid_dif", "num_concavities", "fov",
            "cell_meta_cluster"]

obs = fov0_cell_table_n[obs_cols].copy()
obs["label"] = obs["label"].astype(int)

# So Pandas defaults to PyArrow when it assumes it can for strings. 
# And this causes issues when saving the AnnData as a Zarr Store.
# You can convert the PyArrow string columns to str and that'll fix it.
obs["cell_meta_cluster"] = obs["cell_meta_cluster"].astype(str)
obs["fov"] = obs["fov"].astype(str)

Make sure to use Categorical Dtypes as they provide very efficient indexing / sub-setting for well, categorical properties.

obs["cell_meta_cluster"] = pd.Categorical(obs["cell_meta_cluster"])
obs_names = obs["label"].astype(int).to_list()
obs["cell_meta_cluster"].dtype

Here we create the AnnData object with X, obs and the $\tt arcsinh$ transformation of X. You might see the following warning ImplicitModificationWarning, where it converts the current index (usually an integer) to a string.

The layers parameter is a mapping from a string key to a matrix of the same shape as X, representing the same data just transformed by an operation, linear or otherwise. You can have multiple layers as well.

fov0_adata = AnnData(X=X_n, obs=obs, layers={"arcsinh": X_a})

X gets saved as a NumPy NDArray.

type(fov0_adata.X)

We can view X as a DataFrame by using the to_df() method.

fov0_adata.to_df()

Since we created that layer in the initialization of fov0_adata, we can also view it's DataFrame representation of X.

fov0_adata.to_df(layer="arcsinh")

Additional Components for `SquidPy` Support

If we want to make use of SquidPy's Spatial Analysis tools, we need to add the following components to our AnnData object:

spatial_key: Several functions in SquidPy make use of this key to identify the spatial coordinates of each cell ($X$, $Y$). This key lives in obsm and is a 2D matrix of shape $n_{obs} \times 2$.
- For an example function see gr.spatial_neighbors

We will also rename the Centroid 0 and Centroid 1 columns to something more meaningful to demonstrate modifying an AnnData object.

# An `obs` component is just a DataFrame
type(fov0_adata.obs)

fov0_adata.obs.rename(columns={"centroid-0": "Centroid X", "centroid-1": "Centroid Y" }, inplace=True)

# View the `obs` table with modified column names
fov0_adata.obs

spatial_x_y = fov0_adata.obs[["Centroid X", "Centroid Y"]]
fov0_adata.obsm["spatial"] = spatial_x_y

fov0_adata.obsm["spatial"].shape

Spatial Analysis Components

Here we will discuss how to convert many (not all) of the spatial analysis results to AnnData components.

Distance Matrices

The distance matrices get saved as a netcdf3 file, which can be loaded in with xarray.

import xarray as xr

fov0_distances_xr = xr.open_dataarray(base_dir / "spatial_analysis/dist_mats/fov0_dist_mat.xr",
                                      chunks="auto", chunked_array_type="dask")

fov0_distances_xr

The shape of the distance matrix for FOV 0 is $669 \times 669$, which we can generalize to of n_obs $\times$ n_obs.

We can extract the underlying NumPy array from the xarray.DataArray with the .data attribute.

A quick sidenote on .data, it preserves the underlying array type, be it a Dask Array, in memory NumPy array, SciPy sparse array, or a CuPy GPU Array.

obsp is a container to store pairwise annotations of observations, so, the pairwise distances should be stored here. We'll use the key "distance" for now to identify the distance matrix.

fov0_adata.obsp["distance"] = fov0_distances_xr.data

fov0_adata.obsp["distance"].compute()

We can view the current AnnData object with the newly added "distance" attribute for obsp.

fov0_adata

The neighborhood counts and frequencies are stored in the following files:

neighborhood_counts-cell_meta_cluster_radius50.csv
neighborhood_freqs-cell_meta_cluster_radius50.csv

We subset on those associated with FOV 0 and extract only the unique cell_meta_cluster labels. This gives us the following:

The shapes would be n_cell_meta_cluster $\times$ n_cell_meta_cluster which is square, but doesn't fit in with any of the other AnnData specs for storing pairwise information. These would be stored in uns.

neighborhood_counts = dd.read_csv(
    base_dir / "spatial_analysis/neighborhood_mats/neighborhood_counts-cell_meta_cluster_radius50.csv")

neighborhood_freqs = dd.read_csv(
    base_dir / "spatial_analysis/neighborhood_mats/neighborhood_freqs-cell_meta_cluster_radius50.csv")

fov0_nc = neighborhood_counts[neighborhood_counts["fov"] == "fov0"].compute()
fov0_nf = neighborhood_freqs[neighborhood_freqs["fov"] == "fov0"].compute()

unique_clusters = fov0_adata.obs["cell_meta_cluster"].unique()

fov0_adata.uns["neighborhood_counts"] = fov0_nc[unique_clusters]
fov0_adata.uns["neighborhood_freqs"] = fov0_nf[unique_clusters]

fov0_adata

Cell Neighbors Analysis

The neighborhood diversity would be ideal for obs since there is a value per cell for each FOV. This would be an additional column. We can perform an outer merge with obs DataFrame and the neighborhood_diversity_cell_meta_cluster_radius50.csv DataFrame per FOV, on the label column.

neighborhood_diversity = dd.read_csv(
    base_dir / "spatial_analysis/cell_neighbor_analysis/neighborhood_diversity_cell_meta_cluster_radius50.csv")

Let's just select the diversity for FOV 0.

neighborhood_diversity_fov0 = neighborhood_diversity[neighborhood_diversity["fov"] == "fov0"].compute()

neighborhood_diversity_fov0["label"] = neighborhood_diversity_fov0["label"].astype(int)

# Merge the neighborhood diversity data with the `obs` table.
fov0_adata.obs = fov0_adata.obs.merge(right = neighborhood_diversity_fov0[["label", "diversity_cell_meta_cluster"]], on = ["label"], how="outer")

fov0_adata.obs_names

neighborhood_diversity_fov0

fov0_adata.obs

uns is a catch-all container (dictionary) for unstructured annotations of the object. It's a good place to store parameters, metadata, and other information that doesn't fit into the other components of AnnData. In this case, we'll store the radius value for the neighborhood analysis.

fov0_adata.uns["neighborhood_diversity"] = {
    "cell_meta_cluster": {
        "diversity_radius": 50
    }
}

The cell distances in cell_meta_cluster_avg_dists-nearest_5.csv contains the average distance of the $k$ the closest cells. The shape of this is n_obs $\times$ n_cell_meta_cluster. This shape fits with obsm, so we can store it there or in uns.

While in genomic analyses, users generally place embeddings of their X matrix in obsm, we can also place the cell distances here as well, anything goes as long as the shapes fit, we just have to make sure to document it.

I have placed the cell_meta_cluster_avg_dists-nearest_5 in obsm["cell_distances].

cell_meta_cluster_avg_dists_nearest5 = dd.read_csv(base_dir / "spatial_analysis/cell_neighbor_analysis/cell_meta_cluster_avg_dists-nearest_5.csv")
cell_meta_cluster_avg_dists_nearest5_fov0 = cell_meta_cluster_avg_dists_nearest5[cell_meta_cluster_avg_dists_nearest5["fov"] == "fov0"].compute()

fov0_adata.obsm["cell_distances"] = cell_meta_cluster_avg_dists_nearest5_fov0[unique_clusters]

fov0_adata.obsm["cell_distances"]

Mixing Scores

The mixing scores are stored in the following files:

{popultation 1}_{population 2}-{mixing type}_mixing_score.csv, and with the example dataset they are CD4_CD8-homogeneous_mixing_score.csv,

Just like the others, we will subset on just FOV 0.

This file has the columns: fov, mixing_score, cell_count, cell_ratio, and a row per FOV.

There are a few ways to add this to the AnnData object, the first is to add the mixing_score, cell_count, the cell_ratio, population 1 and population 2 into the uns such as shown below:

cd4_cd8_homogeneous_mixing_score_fov0 = dd.read_csv(base_dir / "spatial_analysis/mixing_score/CD4_CD8-homogeneous_mixing_score.csv").query("fov == 'fov0'").compute()

cd4_cd8_homogeneous_mixing_score_fov0

fov0_adata.uns["mixing_scores"] = {
    "0": { # The "first" mixing score
        "pop1": "CD4",
        "pop2": "CD8",
        "homogeneous": {
            "mixing_score": cd4_cd8_homogeneous_mixing_score_fov0["mixing_score"].values[0],
            "cell_count": cd4_cd8_homogeneous_mixing_score_fov0["cell_count"].values[0],
            "cell_ratio": cd4_cd8_homogeneous_mixing_score_fov0["cell_ratio"].values[0]
        }
    }
}

fov0_adata.uns["mixing_scores"]

Another approach would be to transform the mixing scores into Matrices, one for each FOV,

For example, here is the mixing score matrix for FOV 0, where (CD4T, CD8T) has been filled.

| | CD4T | CD8T | CD14_monocyte | Bcell | other | M2_macrophage | immune_other | M1_macrophage | APC | stroma | endothelium | Myofibroblast | tumor_ck17 | tumor_ecad | |:-------------:|----------|------|---------------|-------|-------|---------------|--------------|---------------|-----|--------|-------------|---------------|------------|------------| | CD4T | | | | | | | | | | | | | | | | CD8T | 0.256647 | | | | | | | | | | | | | | | CD14_monocyte | | | | | | | | | | | | | | | | Bcell | | | | | | | | | | | | | | | | other | | | | | | | | | | | | | | | | M2_macrophage | | | | | | | | | | | | | | | | immune_other | | | | | | | | | | | | | | | | M1_macrophage | | | | | | | | | | | | | | | | APC | | | | | | | | | | | | | | | | stroma | | | | | | | | | | | | | | | | endothelium | | | | | | | | | | | | | | | | Myofibroblast | | | | | | | | | | | | | | | | tumor_ck17 | | | | | | | | | | | | | | | | tumor_ecad | | | | | | | | | | | | | | |

And fill out all the values for each FOV. This would be stored in uns as well. We would also store matrices like these for cell_count and cell_ratio fields.

For now, I've stored it in uns using the first, nested dictionary approach.

Fiber Segmentation

The fiber segmentation generates two files of interest:

fiber_object_table.csv
1. Columns include: fov, label, centroid-0, centroid-1, major_axis_length, minor_axis_length, orientation, area, eccentricity, euler_number, alignment_score
fiber_stats_table.csv
1. Columns include: fov, pixel_density, fiber_density, avg_major_axis_length, avg_minor_axis_length, avg_orientation, avg_area, avg_eccentricity, avg_euler_number, avg_alignment_score

The fiber segmentation notebook does not make use of the cell level segmentations, and instead creates computes the fiber level segmentations from the raw image.

This one I'm not too sure about, as it doesn't look like there is a X equivalent here, so creating a unique AnnData object may not be ideal. If we could, MuData could be useful as it's a way to represent multi-modal AnnData tables.

Perhaps place it in uns? Let's place in uns for this example.

fiber_object_table_fov0 = dd.read_csv(base_dir / "fiber_segmentation_processed_data/fiber_object_table.csv").query("fov == 'fov0'").compute()
fiber_object_table_fov0.rename(columns={"centroid-0": "Centroid X", "centroid-1": "Centroid Y"})

fiber_stats_table_fov0 = dd.read_csv(base_dir / "fiber_segmentation_processed_data/fiber_stats_table.csv").query("fov == 'fov0'").compute()

fiber_object_table_fov0.drop("fov", axis = 1, inplace=True)

fov0_adata.uns["fiber_object"] = fiber_object_table_fov0

fov0_adata.uns["fiber_stats"] = {
    "pixel_density": fiber_stats_table_fov0["pixel_density"].values[0],
    "fiber_density": fiber_stats_table_fov0["fiber_density"].values[0],
    "avg_major_axis_length": fiber_stats_table_fov0["avg_major_axis_length"].values[0],
    "avg_minor_axis_length": fiber_stats_table_fov0["avg_minor_axis_length"].values[0],
    "avg_orientation": fiber_stats_table_fov0["avg_orientation"].values[0],
    "avg_area": fiber_stats_table_fov0["avg_area"].values[0],
    "avg_eccentricity": fiber_stats_table_fov0["avg_eccentricity"].values[0],
    "avg_euler_number": fiber_stats_table_fov0["avg_euler_number"].values[0],
    "avg_alignment_score": fiber_stats_table_fov0["avg_alignment_score"].values[0],
}

Neighborhood Analysis

The neighborhood analysis notebook generates three files of interest:

cell_table_size_normalized_cell_labels_kmeans_nh.csv
1. Columns include: the normalized cell table columns in addition the kmeans neighborhood column
neighborhood_marker.csv
1. Columns include: kmeans_neighborhood, *var_names (all the markers)
neighborhood_cell_type.csv
1. Columns include: kmeans_neighborhood, *unique_clusters (all the pixie clusters)

cell_table_kmeans_fov0 = dd.read_csv(base_dir / "spatial_analysis/neighborhood_analysis/cell_meta_cluster_radius50_counts/cell_table_size_normalized_cell_labels_kmeans_nh.csv").query("fov == 'fov0'").compute()
cell_table_kmeans_fov0[["label", "kmeans_neighborhood"]] = cell_table_kmeans_fov0[["label", "kmeans_neighborhood"]].astype(int)

# Merge the neighborhood marker data with the `obs` table, subset on the label and kmeans_neighborhood columns
fov0_adata.obs = fov0_adata.obs.merge(right = cell_table_kmeans_fov0[["label", "kmeans_neighborhood"]], on = ["label"], how="outer")

The neighborhood_marker.csv could be stored in varm. Here we've transposed it to fit the AnnData spec.

neighborhood_marker = dd.read_csv(base_dir / "spatial_analysis/neighborhood_analysis/cell_meta_cluster_radius50_counts/neighborhood_marker.csv").compute()

neighborhood_marker["kmeans_neighborhood"] = neighborhood_marker["kmeans_neighborhood"].astype(str)

neighborhood_marker.set_index("kmeans_neighborhood", inplace=True)

fov0_adata.varm["kmeans_neighborhood"] = neighborhood_marker.values.T

Similar to X we can view varm as a DataFrame by using the to_df() method.

fov0_adata.varm.to_df()

The neighborhood_cell_type.csv could be stored in uns due to it's shape.

neighborhood_cell_type = dd.read_csv(base_dir / "spatial_analysis/neighborhood_analysis/cell_meta_cluster_radius50_counts/neighborhood_cell_type.csv").compute()
neighborhood_cell_type["kmeans_neighborhood"] = neighborhood_cell_type["kmeans_neighborhood"].astype(str)

# Convert the kmeans neighborhood column to an index
neighborhood_cell_type.set_index("kmeans_neighborhood", inplace=True)

fov0_adata.uns["neighborhood_cell_type"] = neighborhood_cell_type

fov0_adata

Final `AnnData` Table

Our final AnnData table for FOV 0 now looks like this:

AnnData object with n_obs × n_vars = 669 × 22
    obs: 'cell_size', 'label', 'area', 'eccentricity', 'major_axis_length', 'minor_axis_length', 'perimeter', 'convex_area', 'equivalent_diameter', 'Centroid X', 'Centroid Y', 'major_minor_axis_ratio', 'perim_square_over_area', 'major_axis_equiv_diam_ratio', 'convex_hull_resid', 'centroid_dif', 'num_concavities', 'fov', 'cell_meta_cluster', 'diversity_cell_meta_cluster', 'kmeans_neighborhood'
    uns: 'neighborhood_counts', 'neighborhood_freqs', 'neighborhood_diversity', 'mixing_scores', 'fiber_object', 'fiber_stats', 'neighborhood_cell_type'
    obsm: 'spatial', 'cell_distances'
    varm: 'kmeans_neighborhood'
    layers: 'arcsinh'
    obsp: 'distance'

Some other things to go over would be the Spatial LDA notebook, and I'm sure I've missed a few other metrics as well. But I think this gets an idea of the general structure and how we can convert our data to AnnData components. We can discuss the specifics of which parameters get placed where of course.

One `AnnData` Table per Cohort or One `AnnData` Table per FOV?

One of the main design decisions we have to make is whether to store all the data in one AnnData object, or to split it up into many AnnData objects.

I am in favor of a single AnnData object per FOV. This is mainly because the majority of ScanPy and SquidPy functions tend to work best on an individual "image" by "image" basis.

If we have a single AnnData object per FOV, the square shaped components might not make sense. For example, the take distance component of obsp which is a 2D array consisting of the distances between cell $i$ and cell $j$ for all cells in the FOV. If we have a single AnnData object, this matrix would take up a much larger footprint, and would require some rather obtuse indexing to get the distances for a single FOV.

D matrix

This will quickly take up $\mathcal{O} (n^2)$ space which is not feasible. The majority of it will be empty though and can be represented as a Block Diagonal Matrix.

$$\mathbf{D}_{cohort} = \begin{bmatrix} \mathbf{D}_0 & \mathbf{0} & \cdots & \mathbf{0} \ \mathbf{0} & \mathbf{D}_1 & \cdots & \mathbf{0} \ \vdots & \vdots & \ddots & \vdots \ \mathbf{0} & \mathbf{0} & \cdots & \mathbf{D}_n \end{bmatrix}$$

The SPAIN cohort has about 400 FOVs. While taking advantage of sparsity and Dask can make the overhead of loading the data into memory more manageable. This kind of setup would (generously) upper bound the storage to $\mathcal{O} (n\log{n})$. This is completely feasible, but it can make working with the data more difficult.

In addition, SpatialData is converting to a many AnnData table approach, where a table can be associated to 0,1 or many images / coordinate systems. See scverse/spatialdata#298 for the discussion on this.

Does this mean we need to run a for loop every time we want to run a particular function across all FOV-level AnnData tables (such as scanpy.pp.filter_cells)?

No, we can use AnnCollection to do this. We can lazily concatenate the AnnData objects and perform operations on them as if it was one big AnnData object, internally it'll load each AnnData object into memory as needed with Dask (done automatically for us) and map a function to it.

Internally, the majority of functions in Ark load the whole cell table, but end up iterating by FOV anyway, so this can simplify some functions.

As far as I know the benefits with one table per cohort start and end with having a single file stored to disk. I'm sure there are more, but I wasn't able to find them. Let me know if there are any others.

Storage: Saving and Loading

Let's take a quick detour to discuss storage.

For persistent storage we should save the AnnData table as a Zarr store.

Zarr provides a nice interface for storing AnnData objects, as it's a hierarchical key-value store, which is exactly what AnnData is.

Unfortunately, we will have to write to the Zarr store once we are done modifying the AnnData table. SpatialData provides a psuedo-"backed" feature to write modifications to disk immediately.

Where backed is defined as the property where in memory changes are written to disk immediately.

Zarr backed AnnData stores are sort of slowly in the works. See scverse/anndata#219.

Path(base_dir / "tables/").mkdir(parents=True, exist_ok=True)

#`obs_names` must be a string, or else you won't be able to read it back in
fov0_adata.obs_names = fov0_adata.obs_names.astype(str) 
fov0_adata.write_zarr(base_dir / "tables/fov0.ome.zarr", chunks=(1000, 1000))

We can read Zarr backed AnnData Tables using AnnData.read_zarr().

import anndata

anndata.read_zarr(base_dir / "tables/fov0.ome.zarr")

Lets' take a look at what the saved Zarr file saved at various depths:

Depth 1 Hierarchy

```shell . ├── layers ├── obs ├── obsm ├── obsp ├── uns ├── var ├── varm ├── varp └── X ```

Depth 2 Hierarchy

```shell . ├── layers │ └── arcsinh ├── obs │ ├── _index │ ├── area │ ├── cell_meta_cluster │ ├── cell_size │ ├── Centroid X │ ├── Centroid Y │ ├── centroid_dif │ ├── convex_area │ ├── convex_hull_resid │ ├── diversity_cell_meta_cluster │ ├── eccentricity │ ├── equivalent_diameter │ ├── fov │ ├── kmeans_neighborhood │ ├── label │ ├── major_axis_equiv_diam_ratio │ ├── major_axis_length │ ├── major_minor_axis_ratio │ ├── minor_axis_length │ ├── num_concavities │ ├── perim_square_over_area │ └── perimeter ├── obsm │ ├── cell_distances │ └── coordinates ├── obsp │ └── distance ├── uns │ ├── fiber_object │ ├── fiber_stats │ ├── mixing_scores │ ├── neighborhood_cell_type │ ├── neighborhood_counts │ ├── neighborhood_diversity │ └── neighborhood_freqs ├── var │ └── _index ├── varm │ └── kmeans ├── varp └── X └── 0.0 ```

Depth 3 Hierarchy

```shell . ├── layers │ └── arcsinh │ └── 0.0 ├── obs │ ├── _index │ │ └── 0 │ ├── area │ │ └── 0 │ ├── cell_meta_cluster │ │ ├── categories │ │ └── codes │ ├── cell_size │ │ └── 0 │ ├── Centroid X │ │ └── 0 │ ├── Centroid Y │ │ └── 0 │ ├── centroid_dif │ │ └── 0 │ ├── convex_area │ │ └── 0 │ ├── convex_hull_resid │ │ └── 0 │ ├── diversity_cell_meta_cluster │ │ └── 0 │ ├── eccentricity │ │ └── 0 │ ├── equivalent_diameter │ │ └── 0 │ ├── fov │ │ ├── categories │ │ └── codes │ ├── kmeans_neighborhood │ │ └── 0 │ ├── label │ │ └── 0 │ ├── major_axis_equiv_diam_ratio │ │ └── 0 │ ├── major_axis_length │ │ └── 0 │ ├── major_minor_axis_ratio │ │ └── 0 │ ├── minor_axis_length │ │ └── 0 │ ├── num_concavities │ │ └── 0 │ ├── perim_square_over_area │ │ └── 0 │ └── perimeter │ └── 0 ├── obsm │ ├── cell_distances │ │ ├── _index │ │ ├── APC │ │ ├── Bcell │ │ ├── CD14_monocyte │ │ ├── CD4T │ │ ├── CD8T │ │ ├── endothelium │ │ ├── immune_other │ │ ├── M1_macrophage │ │ ├── M2_macrophage │ │ ├── Myofibroblast │ │ ├── other │ │ ├── stroma │ │ ├── tumor_ck17 │ │ └── tumor_ecad │ └── coordinates │ └── 0.0 ├── obsp │ └── distance │ ├── 0.0 │ ├── 0.1 │ ├── 1.0 │ └── 1.1 ├── uns │ ├── fiber_object │ │ ├── _index │ │ ├── alignment_score │ │ ├── area │ │ ├── centroid-0 │ │ ├── centroid-1 │ │ ├── eccentricity │ │ ├── euler_number │ │ ├── label │ │ ├── major_axis_length │ │ ├── minor_axis_length │ │ └── orientation │ ├── fiber_stats │ │ ├── avg_alignment_score │ │ ├── avg_area │ │ ├── avg_eccentricity │ │ ├── avg_euler_number │ │ ├── avg_major_axis_length │ │ ├── avg_minor_axis_length │ │ ├── avg_orientation │ │ ├── fiber_density │ │ └── pixel_density │ ├── mixing_scores │ │ └── 0 │ ├── neighborhood_cell_type │ │ ├── APC │ │ ├── Bcell │ │ ├── CD14_monocyte │ │ ├── CD4T │ │ ├── CD8T │ │ ├── endothelium │ │ ├── immune_other │ │ ├── kmeans_neighborhood │ │ ├── M1_macrophage │ │ ├── M2_macrophage │ │ ├── Myofibroblast │ │ ├── other │ │ ├── stroma │ │ ├── tumor_ck17 │ │ └── tumor_ecad │ ├── neighborhood_counts │ │ ├── _index │ │ ├── APC │ │ ├── Bcell │ │ ├── CD14_monocyte │ │ ├── CD4T │ │ ├── CD8T │ │ ├── endothelium │ │ ├── immune_other │ │ ├── M1_macrophage │ │ ├── M2_macrophage │ │ ├── Myofibroblast │ │ ├── other │ │ ├── stroma │ │ ├── tumor_ck17 │ │ └── tumor_ecad │ ├── neighborhood_diversity │ │ └── cell_meta_cluster │ └── neighborhood_freqs │ ├── _index │ ├── APC │ ├── Bcell │ ├── CD14_monocyte │ ├── CD4T │ ├── CD8T │ ├── endothelium │ ├── immune_other │ ├── M1_macrophage │ ├── M2_macrophage │ ├── Myofibroblast │ ├── other │ ├── stroma │ ├── tumor_ck17 │ └── tumor_ecad ├── var │ └── _index │ └── 0 ├── varm │ └── kmeans │ └── 0.0 ├── varp └── X └── 0.0 ```

Misc Notes

The actual data, matrices, DataFrames, dictionaries and all are stored in the leaf nodes of the hierarchy as binary data. The rest of the nodes are just metadata, providing column names (if applicable) and other information, like the encoding type and encoding versioning.

Not sure where we can save the AnnData objects, perhaps under a directory called tables in the root of the cohort's dataset?

It's also very much worth considering making use of SquidPy's tool set for single cell spatial analysis. We should read through their code and learn how they interface with AnnData when designing future components / redesigning current ones.

srivarra commented 1 year ago

See #1079 for the notebook where you can run it locally.

srivarra commented 1 year ago

@alex-l-kong @camisowers @jranek @ngreenwald Take a look, run the notebook and let me know your guys' thoughts!

ngreenwald commented 1 year ago

Great, let's discuss today to get everyone's thoughts

angelolab / ark-analysis