MemoryError when running impute_accessibility

liouhy commented 5 months ago

Dear all,

Thanks for the great work!

I am trying to run SCENIC+ with the other public dataset, but when I ran impute_accessibility on my dataset:

imputed_acc_obj = impute_accessibility(cistopic_obj, selected_cells=None, selected_regions=None, scale_factor=10**6, chunk_size=1000)

I got this error:

Traceback (most recent call last):
  File "/path/to/myproject/candidate_enhancers.py", line 15, in <module>
    imputed_acc_obj = impute_accessibility(cistopic_obj, selected_cells=None, selected_regions=None, scale_factor=10**6, chunk_size=1000)
  File "/path/to/myproject/mamba_env/lib/python3.8/site-packages/pycisTopic/diff_features.py", line 478, in impute_accessibility
    imputed_acc, region_names_to_keep = calculate_imputed_accessibility(
  File "/path/to/myproject/mamba_env/lib/python3.8/site-packages/pycisTopic/diff_features.py", line 417, in calculate_imputed_accessibility
    imputed_acc = np.empty(
numpy.core._exceptions.MemoryError: Unable to allocate 472. GiB for an array with shape (880058, 143907) and data type int32

I have read this issue https://github.com/aertslab/scenicplus/issues/241, but unfortunately reserving more memory is not possible at my site.

My question is: Is it feasible to separate the cistopic object into smaller objects with fewer cells and combine them after running impute_accessibility()? I have already set the chunk_size to 1000 on the feature, just wondering whether it is also possible to split the cells.

my versions: scenicplus=1.0.1.dev4+ge4bdd9f numpy=1.22.3 python=3.8.16

Best, liouhy

ghuls commented 5 months ago

You could run it with a subset of cell barcodes. e.g. you run it 4 times for a different subset. Once you have your clusters for each of the 4 runs, you could subsample the biggest clusters in each and use all cell barcodes from all small clusters and the subsampled cell barcodes from the biggest clusters and run the impute_accessibility step again with those cell barcodes so the matrix will fit in memory (and not losing any resolution).

liouhy commented 5 months ago

Did you mean I could subset the cistopic object based on the cell barcodes and run topic modeling separately? If I separate them, at which step should I merge all the sub-dataset?

I did try to run impute_accessibility in each cell type and merge the resulting imputed_acc_object as following. But it still gave me the MemoryError when merging them. Is this what you meant?

imputed_acc_obj = None
for sample in cistopic_obj.cell_data['cell_type'].unique():

    subset_cell = (cistopic_obj.cell_data['cell_type'] == sample)
    cell_list = list(cistopic_obj.cell_data.index[subset_cell])

    temp = impute_accessibility(cistopic_obj, selected_cells=cell_list, selected_regions=None, scale_factor=10**6)
    if imputed_acc_obj is None:
        imputed_acc_obj = temp
    else:
        imputed_acc_obj = imputed_acc_obj.merge([temp], copy=True)

ghuls commented 4 months ago

Did you mean I could subset the cistopic object based on the cell barcodes and run topic modeling separately? If I separate them, at which step should I merge all the sub-dataset?

I did try to run impute_accessibility in each cell type and merge the resulting imputed_acc_object as following. But it still gave me the MemoryError when merging them. Is this what you meant?
imputed_acc_obj = None
for sample in cistopic_obj.cell_data['cell_type'].unique():

    subset_cell = (cistopic_obj.cell_data['cell_type'] == sample)
    cell_list = list(cistopic_obj.cell_data.index[subset_cell])

    temp = impute_accessibility(cistopic_obj, selected_cells=cell_list, selected_regions=None, scale_factor=10**6)
    if imputed_acc_obj is None:
        imputed_acc_obj = temp
    else:
        imputed_acc_obj = imputed_acc_obj.merge([temp], copy=True)

Run the SCENIC+ workflow till you have cell types / clusters for each of your e.g. 4 runs. Then for your biggest clusters you only sample a subset of your cell barcodes (but keep all the cell barcodes for the small clusters). Then you start from the beginning again with all cell barcodes for your small clusters and subset of cell barcodes for the big clusters, so you don't run out of memory.

liouhy commented 4 months ago

I see. Thanks for your advice!

aertslab / scenicplus

MemoryError when running impute_accessibility #290