azizilab / starfysh

Spatial Transcriptomic Analysis using Reference-Free auxiliarY deep generative modeling and Shared Histology
BSD 3-Clause "New" or "Revised" License
87 stars 12 forks source link

Compatibility with Visium HD data? #46

Open Rafael-Silva-Oliveira opened 3 weeks ago

Rafael-Silva-Oliveira commented 3 weeks ago

Hey I was wondering if starfysh works with Visium HD data (over 200-300k spots/bins)? I'm getting memory errors in my local machine and on a server as well with a large amount of memory, so I was wondering if it's optimized to work with visium HD data

Thank you

YinuoJin commented 3 weeks ago

Hi @Rafael-Silva-Oliveira

Previous users have tried starfysh on slide-seq dataset for deconvolution but we haven't tried on Visium HD (I'm assuming it's the latest 10X visium with 2 \mu m resolution?). I'm wondering at which stage does the memory error occurs (e.g. data loading, preprocessing, model training, etc.). Thanks in advance.

Rafael-Silva-Oliveira commented 3 weeks ago

Hi @Rafael-Silva-Oliveira

Previous users have tried starfysh on slide-seq dataset for deconvolution but we haven't tried on Visium HD (I'm assuming it's the latest 10X visium with 2 \mu m resolution?). I'm wondering at which stage does the memory error occurs (e.g. data loading, preprocessing, model training, etc.). Thanks in advance.

Hei, yes 2 micron resolution, but using either 8 (> 580k bins/spots) or 16 micron (>150k bins/spots) still gives the same error for the sample I'm using to test (the visium HD lung dataset from 10X Visium website); The error happens when defining the VisiumArguments

visium_args = utils.VisiumArguments(
    adata,
    adata_normed,
    gene_sig,
    img_metadata,
    n_anchors=60,
    window_size=3,
    sample_id=sample_id,
)
YinuoJin commented 3 weeks ago

Thanks for the update! I'll test the visium HD dataset from 10X to diagnose at which stage of Starfysh was RAM intensive. In the meanwhile, does running regular scanpy preprocessing (e.g. library-size normalizing, UMAP visualization) work properly on your machine?

Rafael-Silva-Oliveira commented 3 weeks ago

Thanks for the update! I'll test the visium HD dataset from 10X to diagnose at which stage of Starfysh was RAM intensive. In the meanwhile, does running regular scanpy preprocessing (e.g. library-size normalizing, UMAP visualization) work properly on your machine?

Hey, yes, the regular scnapy works fine (both on my machine and server). But the terminal is simply killed after a few minutes once visium_args starts running (both machine and server). No informative error message, the terminal simply says "Killed" and any debugging session or script running stops

Rafael-Silva-Oliveira commented 2 weeks ago

Thanks for the update! I'll test the visium HD dataset from 10X to diagnose at which stage of Starfysh was RAM intensive. In the meanwhile, does running regular scanpy preprocessing (e.g. library-size normalizing, UMAP visualization) work properly on your machine?

I created a subsample with 15k bins and it still crashes, so it would have to be optimized a fair bit to use the 8 micron resolution of Visium HD :)


[2024-06-17 14:29:49] Subsetting highly variable & signature genes ...
joblib.externals.loky.process_executor._RemoteTraceback: 
"""
Traceback (most recent call last):
  File "<path_to_project>\lib\site-packages\joblib\_utils.py", line 72, in __call__    return self.func(**kwargs)
  File "<path_to_project>\lib\site-packages\joblib\parallel.py", line 598, in __call__
    return [func(*args, **kwargs)
  File "<path_to_project>\lib\site-packages\joblib\parallel.py", line 598, in <listcomp>
    return [func(*args, **kwargs)
MemoryError: Allocation failed (probably too large).
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<path_to_project>\lib\site-packages\<project_name>\utils.py", line 96, in __init__
    sc.pp.neighbors(self.adata_norm, n_neighbors=15, n_pcs=40, use_rep='X')
  File "<path_to_project>\lib\site-packages\scanpy\neighbors\__init__.py", line 176, in neighbors
    neighbors.compute_neighbors(
  File "<path_to_project>\lib\site-packages\scanpy\neighbors\__init__.py", line 561, in compute_neighbors
    self._distances = transformer.fit_transform(X)
  File "<path_to_project>\lib\site-packages\sklearn\utils\_set_output.py", line 313, in wrapped
    data_to_wrap = f(self, X, *args, **kwargs)
  File "<path_to_project>\lib\site-packages\pynndescent\pynndescent_.py", line 2252, in fit_transform
    self.fit(X, compress_index=False)
  File "<path_to_project>\lib\site-packages\pynndescent\pynndescent_.py", line 2170, in fit
    self.index_ = NNDescent(
  File "<path_to_project>\lib\site-packages\pynndescent\pynndescent_.py", line 829, in __init__
    leaf_array = rptree_leaf_array(self._rp_forest)
  File "<path_to_project>\lib\site-packages\pynndescent\rp_trees.py", line 1436, in rptree_leaf_array
    return np.vstack(rptree_leaf_array_parallel(rp_forest))
  File "<path_to_project>\lib\site-packages\pynndescent\rp_trees.py", line 1428, in rptree_leaf_array_parallel
    result = joblib.Parallel(n_jobs=-1, require="sharedmem")(
  File "<path_to_project>\lib\site-packages\joblib\parallel.py", line 2007, in __call__
    return output if self.return_generator else list(output)
  File "<path_to_project>\lib\site-packages\joblib\parallel.py", line 1650, in _get_outputs
    yield from self._retrieve()
  File "<path_to_project>\lib\site-packages\joblib\parallel.py", line 1754, in _retrieve
    self._raise_error_fast()
  File "<path_to_project>\lib\site-packages\joblib\parallel.py", line 1789, in _raise_error_fast
    error_job.get_result(self.timeout)
  File "<path_to_project>\lib\site-packages\joblib\parallel.py", line 745, in get_result
    return self._return_or_raise()
  File "<path_to_project>\lib\site-packages\joblib\parallel.py", line 763, in _return_or_raise
    raise self._result
MemoryError: Allocation failed (probably too large).
YinuoJin commented 1 week ago

Hi @Rafael-Silva-Oliveira,

Sorry for the late reply! I reproduced the 16-micron-binned visium HD processing with the Starfysh function you attached utils.VisiumArguments. It runs without memory issues on a machine with 64G RAM. The induced MemoryError was likely from the computation of neighborhood graph & UMAPs (sc.pp.neighbors; sc.tl.umap). I'll update the processing function this week to ensure the 1). removal of unnecessary computations and 2). processing hyperparameters specified by users.

In addition, the way spatial information stored in VisiumHD is different from regular Visium, so we'll try to address that as well. My quickest suggestion would be running Starfysh (or other processing / analysis package) on a server with larger memory.

Here's the reproducing logs, which shows the "standard" precessing indeed took a long time:

[2024-06-25 17:51:23] Subsetting highly variable & signature genes ...
[2024-06-25 17:58:19] Smoothing library size by taking averaging with neighbor spots...
[2024-06-25 17:59:45] Retrieving & normalizing signature gene expressions...
[2024-06-25 17:59:57] Identifying anchor spots (highly expression of specific cell-type signatures)...