Saving StereoExpData as new GEF exceeds memory requirements

bmill3r commented 1 year ago

Hello,

I am wondering if there is a more efficient way to save a StereoExpData as a new GEF file after filtering that does not consume so much memory. For example, after filtering, I have a StereoExpData, but when trying to save it to a GEF, it requires 758. GiB of memory, which exceeds my memory requirements by several fold.

out_tissue_bin1_e4

StereoExpData object with n_cells X n_genes = 4582375 X 44382
bin_type: bins
bin_size: 1
offset_x = 1
offset_y = 2
cells: ['cell_name', 'total_counts', 'n_genes_by_counts', 'pct_counts_mt']
genes: ['gene_name', 'n_cells', 'n_counts', 'mean_umi']

When trying to save this as a new GEF:

st.io.write_mid_gef(
        data=out_tissue_bin1_e4,
        output="outsideTissue.gef"
        )

[2023-08-18 20:43:56][Stereo][7922][MainThread][140220713068352][writer][267][INFO]: The output standard gef file only contains one expression matrix with mid count.Please make sure the expression matrix of StereoExpData object is mid count without normaliztion.
---------------------------------------------------------------------------
MemoryError                               Traceback (most recent call last)
Cell In[91], line 1
----> 1 st.io.write_mid_gef(
      2         data=out_tissue_bin1_e4,
      3         output="outsideTissue.gef"
      4         )

File ~/mambaforge/envs/py3.8/lib/python3.8/site-packages/stereo/io/writer.py:272, in write_mid_gef(data, output)
    270 final_exp = []  # [(x_1,y_1,umi_1),(x_2,y_2,umi_2)]
    271 final_gene = []  # [(A,offset,count)]
--> 272 exp_np = data.exp_matrix.toarray()
    274 for i in range(exp_np.shape[1]):
    275     gene_exp = exp_np[:, i]

File ~/mambaforge/envs/py3.8/lib/python3.8/site-packages/scipy/sparse/compressed.py:1039, in _cs_matrix.toarray(self, order, out)
   1037 if out is None and order is None:
   1038     order = self._swap('cf')[0]
-> 1039 out = self._process_toarray_args(order, out)
   1040 if not (out.flags.c_contiguous or out.flags.f_contiguous):
   1041     raise ValueError('Output array must be C or F contiguous')

File ~/mambaforge/envs/py3.8/lib/python3.8/site-packages/scipy/sparse/base.py:1202, in spmatrix._process_toarray_args(self, order, out)
   1200     return out
   1201 else:
-> 1202     return np.zeros(self.shape, dtype=self.dtype, order=order)

MemoryError: Unable to allocate 758. GiB for an array with shape (4582375, 44382) and data type uint32

Thanks, Brendan

tanliwei-coder commented 1 year ago

I note that your data is very huge, because of the function write_mid_gef will change the expression matrix from sparse matrix to a numpy.ndarray, it is difficult to reduce the consumption of memory, we will try to optimize on a future verson as we can.

bmill3r commented 1 year ago

Hi @tanliwei-coder

Thanks for getting back to me. Is there a way to convert a bin1 GEF StereoExpObject to a larger size bin, like bin10, on the fly? Or does the data need to be read back in via st.io.read_gef_info() to make this conversion?

Thanks again, Brendan

tanliwei-coder commented 1 year ago

Hi @tanliwei-coder

Thanks for getting back to me. Is there a way to convert a bin1 GEF StereoExpObject to a larger size bin, like bin10, on the fly? Or does the data need to be read back in via st.io.read_gef_info() to make this conversion?

Thanks again, Brendan

Can not be converted on fly, it need to be read agian by running st.io.read_gef() and set the parameter bin_size to 10.

read_gef_info is only used for reading meta data of GEF.

STOmics / Stereopy

Saving StereoExpData as new GEF exceeds memory requirements #156