BodenmillerGroup / steinbock

A toolkit for processing multiplexed tissue images
https://bodenmillergroup.github.io/steinbock
MIT License
49 stars 14 forks source link

ArrayMemoryError - 'measure_centroid_distance_neighbors' #222

Closed paulrbuckley-kcl closed 10 months ago

paulrbuckley-kcl commented 1 year ago

Hi all,

For a handful of large CODEX WSI (but no larger than others that run fine), I receive the following error when running neighbor measurements:

2023-09-05 21:06:22,273 ERROR steinbock.measurement.neighbors - Error measuring neighbors in /data/masks/KCL567-440-A18_LN_post_top_noCD6.tiff
Traceback (most recent call last):
  File "/app/steinbock/steinbock/measurement/neighbors.py", line 195, in try_measure_neighbors_from_disk
    neighbors = measure_neighbors(
  File "/app/steinbock/steinbock/measurement/neighbors.py", line 178, in measure_neighbors
    return neighborhood_type.value(mask, metric=metric, dmax=dmax, kmax=kmax)
  File "/app/steinbock/steinbock/measurement/neighbors.py", line 48, in _measure_centroid_distance_neighbors
    condensed_dists = pdist(centroids, metric=metric)
  File "/opt/venv/lib/python3.8/site-packages/scipy/spatial/distance.py", line 2233, in pdist
    return pdist_fn(X, out=out, **kwargs)
numpy.core._exceptions._ArrayMemoryError: Unable to allocate 1.65 TiB for an array with shape (226214549506,) and data type float64

It's asking for 1.65Tib of memory which is obviously enormous.. I'm wondering if you had any suggestions for how to work around this? The command is below - was thinking to change dmax? Thanks for your help!

$steinbock measure neighbors --type centroids --dmax 15 --masks /data/masks/ -o /data/neighbors_thresholddist/

jwindhager commented 1 year ago

For a given cell count n, the function needs to compute n^2 pairwise distances subject to thresholding, each of which is stored as float64 (i.e., 8 bytes). For around n = 500 000 cells, which seems to be the case here, this would indeed result in a memory footprint of 1.8TB. Unfortunately, there is no easy way around this with the current implementation. You could of course operate on tiles (steinbock utils mosaics), but then you would obviously miss neighbors across tile borders. The best solution, however, would be to not store distances that exceed dmax in memory, which would require a rewrite of this functionality. Maybe the current maintainer of steinbock, @Milad4849, can comment on whether this is something that is on the roadmap.

jwindhager commented 1 year ago

Oh, almost forgot: you might want to try --type borders or --type expansion, together with --mmap enabled, as these functions do not compute all pairwise distances. The command may take a long time, though, so no guarantee that this will work within reasonable time.

paulrbuckley-kcl commented 1 year ago

thanks, v helpful. What I find odd is that for 450k cells there's no issue with 60GB of memory. Not quite got my head around that. I'm assuming them reducing dmax = e.g., 10 won't help? I will absoluytely try --type borders with mmap. Thanks