Aim is to use SpatialData is the data storage backend underlying the SPARCS workflow. This allows us to interface with many existing software solutions and build up on the existing software frameworks.

Needs

Data Saving: we need an sdata object for each SPARCSpy run which contains all the data associated with that project. This data includes:
1. input images:
  1. multi-channel (ideally multi-scale) images that are the input for all processing CXY
  2. multi-channel multi-region images that are input for all processing NCXY
2. segmentation masks:
  1. nuclear mask
  2. cytosolic mask
3. classification results:
  1. per cell classification results that match to cell-ids listed in the segmentation masks → tabular views
4. calibration crosses: points per slide that can be used for LMD excision
5. potential additional information:
  1. centers of each cell linked to celled
  2. regions of the slide for subdividing subsequent analysis (e.g for annotation specific slide regions)
Data visualisation:
1. overlay different channels with one another in different colors
2. overlay segmentation masks with input channels
3. color segmentation masks according to properties in the classification results

To Dos

Classic Project structure

[x] write a spatialdata writer for SPARCStools to directly translate stitched images into sdata format
[x] write a function to save segmentation masks to sdata object
[x] adapt extraction workflow to read results from sdata object
[x] adapt classification workflow to write results to sdata object
[ ] adapt selection workflow to be able to read coordinates from sdata object and segmentation masks from sdata object

Batched Project structure

[ ] write function to save multi stacked images to sdata
[ ] write a function to save mutlistacked segmentation masks to sdata object
[ ] adapt extraction workflow to read from multi-stacked images
[ ] adapt classification workflow to write results to multi-stacked sdata

Issues that need to be addressed

[ ] cell_ids are currently not saved to labels layer but are recalculated on loading #42
[ ] input images are always rendered as rgb instead of as single-channel images #44

Questions that remain to be addressed

~~how can we save images so that they are displayed in napari as individual channels~~
~~how can we link 1 adata table to several annotations (cytosol and nuclei)~~ → each segmentation layer should receive its own table. Technically the classification results are not necessarily the same for both layers
if a segmentation is updated after it has been mapped to file are the results automatically saved or do they need to be forcefully written: not automatically saved and only the modified scale is updated → this issue needs to be circumvented so that masks are updated in memory mapped arrays instead of the sdata object itself → implementation of sdata file handler to solve this issue
how does visualisation of multi-stacks work? is this a realistic way to approach batched projects?

segmentation workflow is functional for sharded and regular segmentations as of this commit: https://github.com/MannLabs/SPARCSspatial/commit/523f142389c7ff1e60aae3e8004b3cc6864bc950

Small benchmarking comparison on the same dataset (relatively small 2000x2000 px)

	Sharded	Regular
sparcsspatial	38.6 s ± 76.6 ms	30.4 s ± 313 ms
SPARCSpy	27.3 s ± 1.29 s	29.7 s ± 570 ms

Sharding generates some overhead in the new sparcsspatial implementation vs non-shading but is required for larger than memory computations.

This overhead seems to be larder than in the original SPARCSpy implementation. This is probably a result from a suboptimal implementation of the sharding resolution currently found in the working version.

We first generate a memory mapped temp array that we aggregate all results to. Then we transfer this array to the sdata object. This workaround was implemented because I did not find a way to update a mask in the sdata object in iterative steps while always backing to memory. Maybe a solution exists that I have not found yet that would allow us to circumvent this additional step and make the process more efficient.

extraction workflow is functional for single-threaded and multi-threaded processes as of this commit: dd3ee0f7b5b37fdb8a219f011cf445c9ab36446b

Small Benchmarking comparison on the same dataset (relatively small 2000x2000 px, total of 683 unique cells to extract)

Method	Threads = 1	Threads = 2	Threads = 3
sparcsspatial	29.4 s ± 2.48 s per loop	19.8 s ± 1.89 s per loop	15.3 s ± 1.63 s per loop
sparcspy	3.31 s ± 2.02 s per loop	1.63 s ± 636 ms per loop	2.28 s ± 2.03 s per loop

Performance in spatialdata seems to be much lower than in sparcspy. This is most likely a result of lower read speeds being achieved from the sdata object vs HDF5.

The chosen example is also not ideal to really benchmark the performance of multiple threads as not enough cells are being extracted to counterbalance the overhead generated by needing to instantiate multiple workers.

By changing the chunking of the segmentation masks to reflect the default chunking set up in SPARCSpy it was possible to reduce the extraction time from 29.4 s ± 2.48 s per loop to 18.5 s ± 811 ms per loop. While this is still slower than in SPARCSpy it gives us a starting point to start addressing this issue.

Benchmarking of reading/writing speed to sdata vs H5PY

I developed a script which mimics the extraction workflow setup but initialised with random data and exactly identical code executions to better be able to compare the performance between h5py and sdata as backend for saving prior results.

Used code:

import h5py
import spatialdata as spdata
import numpy as np
import time
from tqdm.auto import tqdm

chunk_sizes = [50, 100, 1000, 5000, 10000]

#initialize data structures for saving timing results
times_reading_hdf5 = {}
times_reading_sdata = {}
times_write_hdf5 = {}
times_write_sdata = {}

for chunk_size in chunk_sizes:
    times_reading_hdf5[chunk_size] = []
    times_reading_sdata[chunk_size] = []
    times_write_hdf5[chunk_size] = []
    times_write_sdata[chunk_size] = []

for chunk_size in tqdm(chunk_sizes):

    for iteration in tqdm(range(0, 3)):
        image = np.random.randint(0, np.iinfo(np.uint16).max, (3, 10000, 10000), dtype=np.uint16)
        mask = np.random.randint(0, np.iinfo(np.uint32).max, (2, 10000, 10000), dtype=np.uint32)

        #HDF5 write
        start = time.time()
        with h5py.File("test.h5", "w") as hf:
            hf.create_dataset("input_image", 
                            data=image, 
                            chunks=(1, chunk_size, chunk_size))

            hf.create_dataset(
                    "seg_mask",
                    data=mask,
                    chunks=(1, chunk_size, chunk_size),
                )
        time_write_h5py = time.time() - start
        times_write_hdf5[chunk_size].append(time_write_h5py)

        #SDATA WRITE
        start = time.time()
        sdata = spdata.SpatialData()
        sdata.write("test.sdata", overwrite=True)

        input_image = spdata.models.Image2DModel.parse(image, dims=["c", "y", "x"], transformations= {"global": spdata.transformations.transformations.Identity()}, chunks=(1, 50, 50))
        input_image.data = input_image.data.rechunk((1, 50, 50))
        sdata.images["input_image"] = input_image
        sdata.write_element("input_image")

        mask_0 = spdata.models.Labels2DModel.parse(image[0],  dims=["y", "x"], transformations={"global": spdata.transformations.transformations.Identity()}, chunks=(50, 50))
        mask_1 = spdata.models.Labels2DModel.parse(image[1], dims=["y", "x"], transformations={"global": spdata.transformations.transformations.Identity()}, chunks=(50, 50))

        mask_0.data = mask_0.data.rechunk((chunk_size, chunk_size))
        mask_1.data = mask_1.data.rechunk((chunk_size, chunk_size))

        sdata.labels["nuc_seg"] = mask_0
        sdata.labels["cyto_seg"] = mask_1
        sdata.write_element("nuc_seg")
        sdata.write_element("cyto_seg")

        time_write_sdata = time.time() - start
        times_write_sdata[chunk_size].append(time_write_sdata)

        #randomly generate 5000 xy coordinates that are within the image size
        coords = np.random.randint(0, 10000, (5000, 2))
        image_size = 128
        width = 128//2

        #simulate extraction with h5py
        start = time.time()
        with h5py.File("test.h5",
                    "r",
                    rdcc_nbytes=5242880000,
                    rdcc_w0=1,
                    rdcc_nslots=50000,
                )  as hf:

            masks = hf["seg_mask"]
            images = hf["input_image"]

            image_width = images.shape[1]
            height = images.shape[2]

            for i in range(0, len(coords)):
                x, y = coords[i]

                window_y = slice(x-width, x+width)
                window_x = slice(y-width, y+width)

                condition = [
                        width < x,
                        x < image_width - width,
                        width < y,
                        y < height - width,
                    ]

                if np.all(condition):
                    mask = masks[:2, window_x, window_y]
                    image = images[:, window_x, window_y]

                    #do something with the mask and image
                    mask = mask == 10
                    image = image * mask[0]

        time_read_h5py = time.time() - start
        times_reading_hdf5[chunk_size].append(time_read_h5py)

        #simulate extraction with sdata
        start = time.time()

        for i in range(0, len(coords)):
            x, y = coords[i]
            window_x = slice(x-width, x+width)
            window_y = slice(y-width, y+width)

            mask1 = sdata.labels["nuc_seg"].data[window_x, window_y]
            mask2 = sdata.labels["cyto_seg"].data[window_x, window_y]
            image = sdata.images["input_image"].data[:,window_x, window_y]

            mask1 = (mask1 == 10)
            mask2 = (mask2 == 10)
            image = image * mask1

        time_read_sdata = time.time() - start
        times_reading_sdata[chunk_size].append(time_read_sdata)

#code for plotting
import pandas as pd
data = {
    'chunk': [],
    'times_read_h5py': [],
    'times_read_sdata': [],
    'times_write_h5py': [],
    'times_write_sdata': []
}

for chunk in times_reading_hdf5.keys():
    for i in range(len(times_reading_hdf5[chunk])):
        data['chunk'].append(chunk)
        data['times_read_h5py'].append(times_reading_hdf5[chunk][i])
        data['times_read_sdata'].append(times_reading_sdata[chunk][i])
        data['times_write_h5py'].append(times_write_hdf5[chunk][i])
        data['times_write_sdata'].append(times_write_sdata[chunk][i])

# Create the DataFrame
full_results = pd.DataFrame(data)

results_reading = full_results.get(["chunk", "times_read_sdata", "times_read_h5py"]).groupby(["chunk"]).agg(["mean", "std"])
results_writing = full_results.get(["chunk", "times_write_sdata", "times_write_h5py"]).groupby(["chunk"]).agg(["mean", "std"])

# Plot the data
import matplotlib.pyplot as plt

df = results_reading
plt.figure(figsize=(10, 6))

# SDATA read times
plt.errorbar(df.index, df[('times_read_sdata', 'mean')], yerr=df[('times_read_sdata', 'std')],
             label='SDATA Read Time', fmt='-o', capsize=5)
# H5PY read times
plt.errorbar(df.index, df[('times_read_h5py', 'mean')], yerr=df[('times_read_h5py', 'std')],
             label='H5PY Read Time', fmt='-o', capsize=5)

plt.xlabel('Chunk Size')
plt.ylabel('Time (s)')
plt.title('Read Times with Error Bars')
plt.legend()

plt.xscale('log')  # If chunks are on a logarithmic scale
#plt.yscale('linear')  # Adjust if needed

plt.show()

# Plot the data
import matplotlib.pyplot as plt

df = results_writing
plt.figure(figsize=(10, 6))

# SDATA read times
plt.errorbar(df.index, df[('times_write_sdata', 'mean')], yerr=df[('times_write_sdata', 'std')],
             label='SDATA Write Time', fmt='-o', capsize=5)
# H5PY read times
plt.errorbar(df.index, df[('times_write_h5py', 'mean')], yerr=df[('times_write_h5py', 'std')],
             label='H5PY Write Time', fmt='-o', capsize=5)

plt.xlabel('Chunk Size')
plt.ylabel('Time (s)')
plt.title('Write Times with Error Bars')
plt.legend()

plt.xscale('log')  # If chunks are on a logarithmic scale
#plt.yscale('linear')  # Adjust if needed

plt.show()

Executing this code and tracking the resulting computation times gives the following results:

chunk	times_read_sdata_mean	times_read_sdata_std	times_read_h5py_mean	times_read_h5py_std
50	6.031215	0.072047	1.042576	0.068853
100	6.751713	0.176368	0.605539	0.042001
1000	6.562206	0.083729	0.526818	0.017236
5000	6.800794	0.292802	0.533734	0.030269
10000	6.686381	0.112755	0.540749	0.034289

chunk	times_write_sdata_mean	times_write_sdata_std	times_write_h5py_mean	times_write_h5py_std
50	66.940983	5.739447	1.243855	0.116332
100	50.373895	1.717572	0.620364	0.021012
1000	45.023347	0.529781	0.381684	0.010140
5000	44.070088	0.114096	0.394537	0.030758
10000	45.046422	0.561808	0.380628	0.014421

reading_times writing_times

Potential Solution idea to slow read times from sdata: map relevant sdata elements to memory-mapped temp arrays and perform extraction from those.

Initial implementation in commit: bb16cdd

Updated benchmarking with the new method:

Method	Threads = 1	Threads = 2	Threads = 3
sparcsspatial	29.4 s ± 2.48 s per loop	19.8 s ± 1.89 s per loop	15.3 s ± 1.63 s per loop
sparcsspatial (mmap)	12 s ± 1.78 s per loop	12.4 s ± 1.86 s per loop	10.2 s ± 307 ms per loop
sparcspy	3.31 s ± 2.02 s per loop	1.63 s ± 636 ms per loop	2.28 s ± 2.03 s per loop

The use of memory mapped temp arrays seems to speed up the actual extraction process but generates some overhead required to map the images/masks to disk beforehand. To better understand this tradeoff a proper benchmarking needs to be performed where the following things are quantified:

time to load to memory mapped array for various image sizes
extraction time per cell (compared across the three methods)

Update

The significant difference between sparcspy and the sparcsspatial implementation was resulting from no longer using memory mapped temp arrays for saving intermediate results that were assigned as global variables but instead reconnecting to these arrays in each save process. This generates a lot of time overhead. By reimplementing the workflow to only connect once to the memory mapped temp arrays per process (so for single-threaded once during the extraction and for multi-threaded once for the thread call) this significantly boosted the extraction times. This fix was then also transferred to the SPARCSspatial implementation and benchmarked against each other

Software	Version
SPARCSpy	b289ec7ed16862357277c9a57e7ee4d575ce56f3
SPARCSpy_new_extraction_workflow	cd70c8f9b8aae7b40b85bfc3f202b95b0a3e714a
SPARCSpy_new_extraction_workflow_improved	b37f2fde6cac755ead3f466a44d27961cb55068e
SPARCSspatial	a59c1a497aa37ab44300006e5d8096ff4ea714e2

total_extraction_time rate_of_extraction_time

Replicates of 6 independent runs on the same input dataset with identical segmentation masks.

While the SPARCSspatial implementation is still somewhat less efficient an the base SPARCSpy implementation this is an acceptable result with which we can continue working.

This analysis should be replicated with larger datasets to better estimate the result of multithreading on processing time.

classification workflow implemented as of commit 52e3ccff396970fca1a1fee84956ab8a4c7f0f45

This also entailed some major classification workflow remodelling addressing issue #22.

The classification results are written to tables in the sdata object as well as csv files. This later functionality can be deprecated after we have worked with the sdata tables more.

MannLabs / scPortrait

utilize spatialdata format to save sparcspy results #40

Needs

To Dos

Questions that remain to be addressed

Benchmarking of reading/writing speed to sdata vs H5PY

Update