Open sophiamaedler opened 4 months ago
spatialdata writers for SPARCStools implemented in https://github.com/MannLabs/SPARCStools/commit/13301fa85d60ad787121ee70070b18cfdd90900e
segmentation workflow is functional for sharded and regular segmentations as of this commit: https://github.com/MannLabs/SPARCSspatial/commit/523f142389c7ff1e60aae3e8004b3cc6864bc950
Small benchmarking comparison on the same dataset (relatively small 2000x2000 px)
Sharded | Regular | |
---|---|---|
sparcsspatial | 38.6 s ± 76.6 ms | 30.4 s ± 313 ms |
SPARCSpy | 27.3 s ± 1.29 s | 29.7 s ± 570 ms |
Sharding generates some overhead in the new sparcsspatial implementation vs non-shading but is required for larger than memory computations.
This overhead seems to be larder than in the original SPARCSpy implementation. This is probably a result from a suboptimal implementation of the sharding resolution currently found in the working version.
We first generate a memory mapped temp array that we aggregate all results to. Then we transfer this array to the sdata object. This workaround was implemented because I did not find a way to update a mask in the sdata object in iterative steps while always backing to memory. Maybe a solution exists that I have not found yet that would allow us to circumvent this additional step and make the process more efficient.
extraction workflow is functional for single-threaded and multi-threaded processes as of this commit: dd3ee0f7b5b37fdb8a219f011cf445c9ab36446b
Small Benchmarking comparison on the same dataset (relatively small 2000x2000 px, total of 683 unique cells to extract)
Method | Threads = 1 | Threads = 2 | Threads = 3 |
---|---|---|---|
sparcsspatial | 29.4 s ± 2.48 s per loop | 19.8 s ± 1.89 s per loop | 15.3 s ± 1.63 s per loop |
sparcspy | 3.31 s ± 2.02 s per loop | 1.63 s ± 636 ms per loop | 2.28 s ± 2.03 s per loop |
Performance in spatialdata seems to be much lower than in sparcspy. This is most likely a result of lower read speeds being achieved from the sdata object vs HDF5.
The chosen example is also not ideal to really benchmark the performance of multiple threads as not enough cells are being extracted to counterbalance the overhead generated by needing to instantiate multiple workers.
By changing the chunking of the segmentation masks to reflect the default chunking set up in SPARCSpy it was possible to reduce the extraction time from 29.4 s ± 2.48 s per loop to 18.5 s ± 811 ms per loop. While this is still slower than in SPARCSpy it gives us a starting point to start addressing this issue.
I developed a script which mimics the extraction workflow setup but initialised with random data and exactly identical code executions to better be able to compare the performance between h5py and sdata as backend for saving prior results.
Used code:
import h5py
import spatialdata as spdata
import numpy as np
import time
from tqdm.auto import tqdm
chunk_sizes = [50, 100, 1000, 5000, 10000]
#initialize data structures for saving timing results
times_reading_hdf5 = {}
times_reading_sdata = {}
times_write_hdf5 = {}
times_write_sdata = {}
for chunk_size in chunk_sizes:
times_reading_hdf5[chunk_size] = []
times_reading_sdata[chunk_size] = []
times_write_hdf5[chunk_size] = []
times_write_sdata[chunk_size] = []
for chunk_size in tqdm(chunk_sizes):
for iteration in tqdm(range(0, 3)):
image = np.random.randint(0, np.iinfo(np.uint16).max, (3, 10000, 10000), dtype=np.uint16)
mask = np.random.randint(0, np.iinfo(np.uint32).max, (2, 10000, 10000), dtype=np.uint32)
#HDF5 write
start = time.time()
with h5py.File("test.h5", "w") as hf:
hf.create_dataset("input_image",
data=image,
chunks=(1, chunk_size, chunk_size))
hf.create_dataset(
"seg_mask",
data=mask,
chunks=(1, chunk_size, chunk_size),
)
time_write_h5py = time.time() - start
times_write_hdf5[chunk_size].append(time_write_h5py)
#SDATA WRITE
start = time.time()
sdata = spdata.SpatialData()
sdata.write("test.sdata", overwrite=True)
input_image = spdata.models.Image2DModel.parse(image, dims=["c", "y", "x"], transformations= {"global": spdata.transformations.transformations.Identity()}, chunks=(1, 50, 50))
input_image.data = input_image.data.rechunk((1, 50, 50))
sdata.images["input_image"] = input_image
sdata.write_element("input_image")
mask_0 = spdata.models.Labels2DModel.parse(image[0], dims=["y", "x"], transformations={"global": spdata.transformations.transformations.Identity()}, chunks=(50, 50))
mask_1 = spdata.models.Labels2DModel.parse(image[1], dims=["y", "x"], transformations={"global": spdata.transformations.transformations.Identity()}, chunks=(50, 50))
mask_0.data = mask_0.data.rechunk((chunk_size, chunk_size))
mask_1.data = mask_1.data.rechunk((chunk_size, chunk_size))
sdata.labels["nuc_seg"] = mask_0
sdata.labels["cyto_seg"] = mask_1
sdata.write_element("nuc_seg")
sdata.write_element("cyto_seg")
time_write_sdata = time.time() - start
times_write_sdata[chunk_size].append(time_write_sdata)
#randomly generate 5000 xy coordinates that are within the image size
coords = np.random.randint(0, 10000, (5000, 2))
image_size = 128
width = 128//2
#simulate extraction with h5py
start = time.time()
with h5py.File("test.h5",
"r",
rdcc_nbytes=5242880000,
rdcc_w0=1,
rdcc_nslots=50000,
) as hf:
masks = hf["seg_mask"]
images = hf["input_image"]
image_width = images.shape[1]
height = images.shape[2]
for i in range(0, len(coords)):
x, y = coords[i]
window_y = slice(x-width, x+width)
window_x = slice(y-width, y+width)
condition = [
width < x,
x < image_width - width,
width < y,
y < height - width,
]
if np.all(condition):
mask = masks[:2, window_x, window_y]
image = images[:, window_x, window_y]
#do something with the mask and image
mask = mask == 10
image = image * mask[0]
time_read_h5py = time.time() - start
times_reading_hdf5[chunk_size].append(time_read_h5py)
#simulate extraction with sdata
start = time.time()
for i in range(0, len(coords)):
x, y = coords[i]
window_x = slice(x-width, x+width)
window_y = slice(y-width, y+width)
mask1 = sdata.labels["nuc_seg"].data[window_x, window_y]
mask2 = sdata.labels["cyto_seg"].data[window_x, window_y]
image = sdata.images["input_image"].data[:,window_x, window_y]
mask1 = (mask1 == 10)
mask2 = (mask2 == 10)
image = image * mask1
time_read_sdata = time.time() - start
times_reading_sdata[chunk_size].append(time_read_sdata)
#code for plotting
import pandas as pd
data = {
'chunk': [],
'times_read_h5py': [],
'times_read_sdata': [],
'times_write_h5py': [],
'times_write_sdata': []
}
for chunk in times_reading_hdf5.keys():
for i in range(len(times_reading_hdf5[chunk])):
data['chunk'].append(chunk)
data['times_read_h5py'].append(times_reading_hdf5[chunk][i])
data['times_read_sdata'].append(times_reading_sdata[chunk][i])
data['times_write_h5py'].append(times_write_hdf5[chunk][i])
data['times_write_sdata'].append(times_write_sdata[chunk][i])
# Create the DataFrame
full_results = pd.DataFrame(data)
results_reading = full_results.get(["chunk", "times_read_sdata", "times_read_h5py"]).groupby(["chunk"]).agg(["mean", "std"])
results_writing = full_results.get(["chunk", "times_write_sdata", "times_write_h5py"]).groupby(["chunk"]).agg(["mean", "std"])
# Plot the data
import matplotlib.pyplot as plt
df = results_reading
plt.figure(figsize=(10, 6))
# SDATA read times
plt.errorbar(df.index, df[('times_read_sdata', 'mean')], yerr=df[('times_read_sdata', 'std')],
label='SDATA Read Time', fmt='-o', capsize=5)
# H5PY read times
plt.errorbar(df.index, df[('times_read_h5py', 'mean')], yerr=df[('times_read_h5py', 'std')],
label='H5PY Read Time', fmt='-o', capsize=5)
plt.xlabel('Chunk Size')
plt.ylabel('Time (s)')
plt.title('Read Times with Error Bars')
plt.legend()
plt.xscale('log') # If chunks are on a logarithmic scale
#plt.yscale('linear') # Adjust if needed
plt.show()
# Plot the data
import matplotlib.pyplot as plt
df = results_writing
plt.figure(figsize=(10, 6))
# SDATA read times
plt.errorbar(df.index, df[('times_write_sdata', 'mean')], yerr=df[('times_write_sdata', 'std')],
label='SDATA Write Time', fmt='-o', capsize=5)
# H5PY read times
plt.errorbar(df.index, df[('times_write_h5py', 'mean')], yerr=df[('times_write_h5py', 'std')],
label='H5PY Write Time', fmt='-o', capsize=5)
plt.xlabel('Chunk Size')
plt.ylabel('Time (s)')
plt.title('Write Times with Error Bars')
plt.legend()
plt.xscale('log') # If chunks are on a logarithmic scale
#plt.yscale('linear') # Adjust if needed
plt.show()
Executing this code and tracking the resulting computation times gives the following results:
chunk | times_read_sdata_mean | times_read_sdata_std | times_read_h5py_mean | times_read_h5py_std |
---|---|---|---|---|
50 | 6.031215 | 0.072047 | 1.042576 | 0.068853 |
100 | 6.751713 | 0.176368 | 0.605539 | 0.042001 |
1000 | 6.562206 | 0.083729 | 0.526818 | 0.017236 |
5000 | 6.800794 | 0.292802 | 0.533734 | 0.030269 |
10000 | 6.686381 | 0.112755 | 0.540749 | 0.034289 |
chunk | times_write_sdata_mean | times_write_sdata_std | times_write_h5py_mean | times_write_h5py_std |
---|---|---|---|---|
50 | 66.940983 | 5.739447 | 1.243855 | 0.116332 |
100 | 50.373895 | 1.717572 | 0.620364 | 0.021012 |
1000 | 45.023347 | 0.529781 | 0.381684 | 0.010140 |
5000 | 44.070088 | 0.114096 | 0.394537 | 0.030758 |
10000 | 45.046422 | 0.561808 | 0.380628 | 0.014421 |
Potential Solution idea to slow read times from sdata: map relevant sdata elements to memory-mapped temp arrays and perform extraction from those.
Initial implementation in commit: bb16cdd
Updated benchmarking with the new method:
Method | Threads = 1 | Threads = 2 | Threads = 3 |
---|---|---|---|
sparcsspatial | 29.4 s ± 2.48 s per loop | 19.8 s ± 1.89 s per loop | 15.3 s ± 1.63 s per loop |
sparcsspatial (mmap) | 12 s ± 1.78 s per loop | 12.4 s ± 1.86 s per loop | 10.2 s ± 307 ms per loop |
sparcspy | 3.31 s ± 2.02 s per loop | 1.63 s ± 636 ms per loop | 2.28 s ± 2.03 s per loop |
The use of memory mapped temp arrays seems to speed up the actual extraction process but generates some overhead required to map the images/masks to disk beforehand. To better understand this tradeoff a proper benchmarking needs to be performed where the following things are quantified:
extraction time per cell (compared across the three methods)
The significant difference between sparcspy and the sparcsspatial implementation was resulting from no longer using memory mapped temp arrays for saving intermediate results that were assigned as global variables but instead reconnecting to these arrays in each save process. This generates a lot of time overhead. By reimplementing the workflow to only connect once to the memory mapped temp arrays per process (so for single-threaded once during the extraction and for multi-threaded once for the thread call) this significantly boosted the extraction times. This fix was then also transferred to the SPARCSspatial implementation and benchmarked against each other
Software | Version |
---|---|
SPARCSpy | b289ec7ed16862357277c9a57e7ee4d575ce56f3 |
SPARCSpy_new_extraction_workflow | cd70c8f9b8aae7b40b85bfc3f202b95b0a3e714a |
SPARCSpy_new_extraction_workflow_improved | b37f2fde6cac755ead3f466a44d27961cb55068e |
SPARCSspatial | a59c1a497aa37ab44300006e5d8096ff4ea714e2 |
Replicates of 6 independent runs on the same input dataset with identical segmentation masks.
While the SPARCSspatial implementation is still somewhat less efficient an the base SPARCSpy implementation this is an acceptable result with which we can continue working.
This analysis should be replicated with larger datasets to better estimate the result of multithreading on processing time.
classification workflow implemented as of commit 52e3ccff396970fca1a1fee84956ab8a4c7f0f45
This also entailed some major classification workflow remodelling addressing issue #22.
The classification results are written to tables in the sdata object as well as csv files. This later functionality can be deprecated after we have worked with the sdata tables more.
Aim is to use SpatialData is the data storage backend underlying the SPARCS workflow. This allows us to interface with many existing software solutions and build up on the existing software frameworks.
Needs
Data Saving: we need an sdata object for each SPARCSpy run which contains all the data associated with that project. This data includes:
Data visualisation:
To Dos
Classic Project structure
Batched Project structure
Issues that need to be addressed
Questions that remain to be addressed
how can we save images so that they are displayed in napari as individual channelshow can we link 1 adata table to several annotations (cytosol and nuclei)→ each segmentation layer should receive its own table. Technically the classification results are not necessarily the same for both layers