aertslab / pycisTopic

pycisTopic is a Python module to simultaneously identify cell states and cis-regulatory topics from single cell epigenomics data.
Other
58 stars 12 forks source link

create_cistopic_object_from_fragments: numpy.core._exceptions._ArrayMemoryError [PERFORMANCE] #130

Closed DmitriiSeverinov closed 7 months ago

DmitriiSeverinov commented 7 months ago

Hi all,

I want to create a cisTopic object and for this I'm running

cistopic_obj_list = []
for sample_id in fragments_dict:
    # sample_metrics = bc_passing_filter[sample_id]
    cistopic_obj = create_cistopic_object_from_fragments(
        path_to_fragments = fragments_dict[sample_id],
        path_to_regions = path_to_regions[sample_id],
        # metrics = sample_metrics,
        # valid_bc = bc_passing_filter[sample_id],
        n_cpu = 8,
        project = sample_id,
        split_pattern = '-'
    )
    cistopic_obj_list.append(cistopic_obj)

And I'm getting the following error

2024-04-11 11:28:44,392 cisTopic     INFO     Reading data for Lesion
2024-04-11 11:29:50,055 cisTopic     INFO     Counting number of unique fragments (Unique_nr_frag)
2024-04-11 11:30:16,410 cisTopic     INFO     Counting fragments in regions
2024-04-11 11:30:58,315 cisTopic     INFO     Creating fragment matrix
/home/biotec_poetsch/dmse952c/miniconda3/envs/new_scenicplus/lib/python3.11/site-packages/pycisTopic/cistopic_class.py:883: PerformanceWarning: The following operation may generate 268705384570 cells in the resulting pandas object.
  .unstack(level="Name", fill_value=0)
2024-04-11 11:33:35,368 cisTopic     INFO     Data is too big, making partitions. This is a reported error in Pandas versions > 0.21 (https://github.com/pandas-dev/pandas/issues/26314)
/home/biotec_poetsch/dmse952c/miniconda3/envs/new_scenicplus/lib/python3.11/site-packages/pycisTopic/cistopic_class.py:951: PerformanceWarning: The following operation may generate 52883251047 cells in the resulting pandas object.
  .unstack(level="Name", fill_value=0)
Traceback (most recent call last):
  File "/home/biotec_poetsch/dmse952c/miniconda3/envs/new_scenicplus/lib/python3.11/site-packages/pycisTopic/cistopic_class.py", line 883, in create_cistopic_object_from_fragments
    .unstack(level="Name", fill_value=0)
     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/biotec_poetsch/dmse952c/miniconda3/envs/new_scenicplus/lib/python3.11/site-packages/pandas/core/series.py", line 4458, in unstack
    return unstack(self, level, fill_value)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/biotec_poetsch/dmse952c/miniconda3/envs/new_scenicplus/lib/python3.11/site-packages/pandas/core/reshape/reshape.py", line 493, in unstack
    return unstacker.get_result(
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/biotec_poetsch/dmse952c/miniconda3/envs/new_scenicplus/lib/python3.11/site-packages/pandas/core/reshape/reshape.py", line 216, in get_result
    values, _ = self.get_new_values(values, fill_value)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/biotec_poetsch/dmse952c/miniconda3/envs/new_scenicplus/lib/python3.11/site-packages/pandas/core/reshape/reshape.py", line 268, in get_new_values
    new_values = np.empty(result_shape, dtype=dtype)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
numpy.core._exceptions._ArrayMemoryError: Unable to allocate 1.96 TiB for an array with shape (561485, 478562) and data type int64

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 3, in <module>
  File "/home/biotec_poetsch/dmse952c/miniconda3/envs/new_scenicplus/lib/python3.11/site-packages/pycisTopic/cistopic_class.py", line 908, in create_cistopic_object_from_fragments
    cistopic_obj_list = [
                        ^
  File "/home/biotec_poetsch/dmse952c/miniconda3/envs/new_scenicplus/lib/python3.11/site-packages/pycisTopic/cistopic_class.py", line 909, in <listcomp>
    create_cistopic_object_chunk(
  File "/home/biotec_poetsch/dmse952c/miniconda3/envs/new_scenicplus/lib/python3.11/site-packages/pycisTopic/cistopic_class.py", line 951, in create_cistopic_object_chunk
    .unstack(level="Name", fill_value=0)
     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/biotec_poetsch/dmse952c/miniconda3/envs/new_scenicplus/lib/python3.11/site-packages/pandas/core/series.py", line 4458, in unstack
    return unstack(self, level, fill_value)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/biotec_poetsch/dmse952c/miniconda3/envs/new_scenicplus/lib/python3.11/site-packages/pandas/core/reshape/reshape.py", line 493, in unstack
    return unstacker.get_result(
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/biotec_poetsch/dmse952c/miniconda3/envs/new_scenicplus/lib/python3.11/site-packages/pandas/core/reshape/reshape.py", line 216, in get_result
    values, _ = self.get_new_values(values, fill_value)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/biotec_poetsch/dmse952c/miniconda3/envs/new_scenicplus/lib/python3.11/site-packages/pandas/core/reshape/reshape.py", line 268, in get_new_values
    new_values = np.empty(result_shape, dtype=dtype)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
numpy.core._exceptions._ArrayMemoryError: Unable to allocate 394. GiB for an array with shape (552519, 95713) and data type int64

Obviously, there is a memory issue here. My dataset has only 14k cells in total. They are distributed across 2 conditions: condition 1 ~6.5k, condition 2 ~7.5k

It was working fine on this dataset before. The only thing I've changed is the cell type annotation. Before it was a crude one with only 14 cell types. Now, I want to go a little bit more in details, therefore I made it more fine-grained, resulting in 28 cell types.

I work now on a cluster with 500Gb RAM. Apparently, this is not enough. I have an option switching to the server with more RAM available, but my concern is that soon I will get more data and I can't increase RAM indefinitely.

Version information

pycisTopic: '2.0a0'

Best, Dmitrii

ghuls commented 7 months ago

There is some work in progress that would replace the current code with a smarter version that will use quite a bit less RAM.

ghuls commented 7 months ago

(561485, 478562) and (552519, 95713) still look way to big if you only have 14k cells.

DmitriiSeverinov commented 7 months ago

Hi @ghuls ,

I am sorry, I think I found the issue for my case, I had the valid_bc parameter being commented...

Sorry for bothering.

Best, Dmitrii