TileDB-Inc / tiledbsc

Single-cell data structures in TileDB
https://tiledb-inc.github.io/tiledbsc/
Other
14 stars 3 forks source link

Optimize COO conversion #75

Closed aaronwolen closed 2 years ago

aaronwolen commented 2 years ago

Converting sparse/dense matrix objects to COO-formatted data frames is now handled by a new more efficient internal utility, matrix_to_coo(), which replaces the old utility, dgtmatrix_to_dataframe().

The new utility accepts any matrix-like object coercible to a TsparseMatrix and uses Matrix::mat2triplet() to perform the conversion. Matrix dimension labels are now stored as factors in the COO data frame's index columns to avoid fully materializing the character vectors. Because the index columns already contain integers that map to the original dimension labels, we manually create the factor vectors to avoid overhead imposed by the factor()/as.factor() constructors.

Converting a 50k × 1k sparse matrix with string dimensions and 60% density is about ~8× faster with matrix_to_coo() compared with dgtmatrix_to_dataframe(), and requires about 1/5 as much memory:

#   expression   min median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time
#   <bch:expr> <dbl>  <dbl>     <dbl> <bch:byt>    <dbl> <int> <dbl>      <dbl>
# 1 old        1397.  1635.     0.597    1.12GB    0.776    10    13     16761.
# 2 new         176.   232.     4.48   228.88MB    0.896    10     2      2233.
aaronwolen commented 2 years ago

CC @dnadave @kaitlin-procogia @augustine-procogia @dan11mcguire