angelolab / ark-analysis

Integrated pipeline for multiplexed image analysis
https://ark-analysis.readthedocs.io/en/latest/
MIT License
68 stars 25 forks source link

AnnData spatial analysis part 1 #1105

Open camisowers opened 5 months ago

camisowers commented 5 months ago

This is for internal use only; if you'd like to open an issue or request a new feature, please open a bug or enhancement issue

Relevant background

Now that we have the AnnData conversion notebook, we can start implementing AnnData with some of the spatial analysis functions.

Design overview

We have two options:

Things which will be adapted for AnnData compatibility in this new PR:

  1. Distance matrices calculation
  2. Neighborhood matrices calculation (will save to both the existing AnnData table and as individual csv)
  3. Cell Distances analysis

Since all the spatial notebooks (except cell distances) only take the neighborhood mats as inputs, these changes will not break any other notebooks if we choose to do a hard fork. ** Note: This will break the spatial enrichment code we have in ark, but I'm pretty confident we can just scrap the old stuff and build on squidpy's nhood_enrichment function.

Code mockup I can write up more specific pseudocode when we decide between a hard or soft fork, but for now here's which functions will need to be altered regardless.

1. Distance matrices calculation

Currently we load in the segmentations masks to get the labels and centroids of the cells by running regionprops(). This doesn't make much sense in our pipeline anymore since this is already done in the segmentation notebook and the centroids are stored in the cell table; I think moving forward we can skip loading in the segmentation masks and re-running regionprops (it's very time consuming using the NAS and also redundant).

Also I will probably do some efficiency testing to find a replacement for cdist() that creates a sparse matrix rather than computing ever pairwise distance between cells.

Adjusted functions: calc_dist_matrix - adjust for AnnData

Soft fork example:

def calc_dist_matrix(cell_table, save_path, anndata_dir=None):
    """Generate matrix of distances between center of pairs of cells.

    Saves each one individually to `save_path`.

    Args:
        anndata_dir (str):
           path where the AnnData object is stored
    """

    if anndata_dir:
       fov_names = list_files(anndata_dir, substr=".zarr")

    else:
       # check that both label_dir and save_path exist
       io_utils.validate_paths([label_dir, save_path])

       fov_names = cell_table.fov.unique()

    # iterate for each fov
    with tqdm(total=len(fov_names), desc="Distance Matrix Generation", unit="FOVs") \
            as dist_mat_progress:
        for fov in fov_names:

            if anndata_dir:
               fov_adata = anndata.read_zarr(os.path.join(anndata_dir, f"{fov}.zarr"))
               centroids = fov_adata.obsm["spatial"]
            else:
                fov_data = cell_table[cell_table["fov"] == fov]
                fov_data = [fov_data.centroid-0, fov_data.centroid-1]

            # generate the distance matrix, then assign centroid_labels as coords
            dist_matrix = cdist(centroids, centroids).astype(np.float32)
            dist_mat_xarr = xr.DataArray(dist_matrix, coords=[centroid_labels, centroid_labels])

            # save distances to AnnData table
            if anndata_dir:
               fov_adata.obsp["distance"] = dist_mat_xarr
               fov_adata.write_zarr(os.path.join(anndata_dir, f"{fov}.zarr"), chunks=(1000, 1000))

            # save the distance matrix to save_path
            dist_mat_xarr.to_netcdf(
                os.path.join(save_path, fov_name + '_dist_mat.xr'),
                format='NETCDF3_64BIT'
            )

            dist_mat_progress.update(1)

Hard fork example:

def calc_dist_matrix(anndata_dir):
    """Generate matrix of distances between center of pairs of cells.

    Args:
        anndata_dir (str):
           path where the AnnData object is stored
    """

    # load fov names
    fov_names = list_files(anndata_dir, substr=".zarr")

    # iterate for each fov
    with tqdm(total=len(fov_names), desc="Distance Matrix Generation", unit="FOVs") \
            as dist_mat_progress:
        for fov in fov_names:

           fov_adata = anndata.read_zarr(os.path.join(anndata_dir, f"{fov}.zarr"))
           centroids = fov_adata.obsm["spatial"]

           # generate the distance matrix, then assign centroid_labels as coords
           dist_matrix = cdist(centroids, centroids).astype(np.float32)
           dist_mat_xarr = xr.DataArray(dist_matrix, coords=[centroid_labels, centroid_labels])

           # save distances to AnnData table
           fov_adata.obsp["distance"] = dist_mat_xarr
           fov_adata.write_zarr(os.path.join(anndata_dir, f"{fov}.zarr"), chunks=(1000, 1000))

           dist_mat_progress.update(1)

2. Neighborhood matrices calculation Adjusted functions: create_neighborhood_matrix

3. Cell Distances analysis Adjusted functions: generate_cell_distance_analysis possibly calculate_mean_distance_to_all_cell_types andcalculate_mean_distance_to_cell_type as well

Required inputs

Cell table / AnnData table.

Output files

Since no one seems to really open or use the saved distance matrix xarrays anyways, I think we can just append them to the AnnData table and save some storage space.

The neighborhood mats and cell distance results will be saved to both the AnnData table and as individual csvs.

Timeline Give a rough estimate for how long you think the project will take. In general, it's better to be too conservative rather than too optimistic.

Estimated date when a fully implemented version will be ready for review: 1/19

Estimated date when the finalized project will be merged in: 1/23

camisowers commented 5 months ago

@srivarra @alex-l-kong @ngreenwald @jranek Lmk your thoughts. I'm leaning towards a hard switch since we just released v0.7.0 of ark and anyone super committed to staying away from AnnData can always use that version.

ngreenwald commented 5 months ago

I agree, I think it makes sense to just switch. Everything else looks good.

srivarra commented 5 months ago

Looks good to me!

alex-l-kong commented 5 months ago

Looks good