Open davidhbrann opened 5 months ago
Do you think it would make sense to add such code to
cistopic_class
? Or are there already plans to rewrite these functions to reduce their memory usage? Changing some of the pandas code to polars as has already been done might increase the speed (and with lazy versions combined with filter and select could reduce memory usage) but reducing code that makes these giant arrays would really help cut down on memory usage. I haven't looked that carefully at the QC code but my impression is that it also uses a ton of memory for things that could be done sequentially or without having as many large objects in memory at once.
Yes, it makes a lot of sense. I wrote already an implementation in polars 2 months ago that uses a very similar approach, but didn't had time yet to integrate it yet in pycisTopic.
region_cb_df_pl = (
gr_intersection(
regions1_df_pl=regions_df_pl,
regions2_df_pl=fragments_cb_filtered_df_pl,
# how: Literal["all", "containment", "first", "last"] | str | None = None,
how="all",
regions1_info=True,
regions2_info=True,
regions1_coord=False,
regions2_coord=False,
regions1_suffix="@1",
regions2_suffix="@2",
)
.rename({"CB@2": "CB"})
.lazy()
.group_by(["RegionID", "CB"])
.agg(
# Get accessibility in binary form.
pl.lit(1).alias("accessible_binary"),
# Get accessibility in count form.
pl.len().alias("accessible_count"),
)
.join(
regions_df_pl.lazy()
.select(pl.col("RegionID"))
.with_row_index("region_idx"),
on="RegionID",
how="left",
)
.join(
cbs.to_frame().lazy().with_row_index("CB_idx"),
on="CB",
how="left",
)
.collect()
)
# Construct binary accessibility matrix as a sparse matrix.
# regions as rows and cells as columns.
binary_matrix = sp.sparse.csr_matrix(
(
# All data points are 1:
# - same as: region_cb_df_pl.get_column("accessible_binary").to_numpy()
# - for count matrix: region_cb_df_pl.get_column("accessible_count").to_numpy()
np.ones(region_cb_df_pl.shape[0], dtype=np.int8),
(
region_cb_df_pl.get_column("region_idx").to_numpy(),
region_cb_df_pl.get_column("CB_idx").to_numpy(),
),
)
)
return binary_matrix
Hi!
Is there an estimate on when this implementation be available?
I'm having the same issue, with just 6 samples, the amount of memory being required by create_cistopic_object_from_fragments
is too large.
Kind regards
@ktroule Hard to predict exactly. It depends whether I would have time for it soon. Some more urgent non-coding related stuff needs my attention first. Likely some other improvements like way lower memory usage for imputed accessibility will land at the same time as we need this internally soon for some datasets.
@davidhbrann @ktroule The polars_1xx branch: https://github.com/aertslab/pycisTopic/compare/main...polars_1xx now contains an implementation: https://github.com/aertslab/pycisTopic/commit/a40e47c2a90e792e8643992da36e900ed1c7708b
It is likely to change in the future.
Some progress is made for significantly reducing memory usage while calculating imputed accessibiltiy: https://github.com/aertslab/pycisTopic/issues/179#issuecomment-2460210793
I'm running pycisTopic but
create_cistopic_object_from_fragments
has significant memory usage.As has already been discussed (e.g. see https://github.com/aertslab/pycisTopic/issues/14), the main bottleneck comes from the following lines:
With 100k cells and 500k regions this makes a 200 GB matrix (of
int32
even though it's soon converted to abool
).When this errors (caught with
except (ValueError, MemoryError)
), the current solution is to divide the data into 5 partitions and sequentially run the above.groupby
merge the resultingcistopic_obj_list
.However, I think it would make a lot more sense to avoid creating this wide and dense matrix in the first place.
counts_df
is already effectively a tidy dataframe, so it can easily be converted to acoo_matrix
and then acsr_matrix
sparse matrix directly without making the densepd.DataFrame
.As an example, see below:
Making the csr_matrix sparse matrix avoids making the dense matrix and for my data uses 50-times less memory (e.g. 4GB vs 200GB). The resulting sparse matrix can then be easily passed to
create_cistopic_object
:I haven't tested it that extensively, but I believe the above code would likely generate the same
cistopic_obj
without making any wide dense arrays. The order of thecell_names
andregion_names
might be slightly different, but in the new sparse version they should be the same order as the categories incounts_df
, which probably makes more sense (and it's easy to callcistopic_obj.subset()
to reorder them.Do you think it would make sense to add such code to
cistopic_class
? Or are there already plans to rewrite these functions to reduce their memory usage? Changing some of the pandas code to polars as has already been done might increase the speed (and with lazy versions combined with filter and select could reduce memory usage) but reducing code that makes these giant arrays would really help cut down on memory usage. I haven't looked that carefully at the QC code but my impression is that it also uses a ton of memory for things that could be done sequentially or without having as many large objects in memory at once.I'm currently using pycisTopic ver
'2.0a0'