bnprks / BPCells

Scaling Single Cell Analysis to Millions of Cells
https://bnprks.github.io/BPCells
Other
167 stars 17 forks source link

rowSums is slower for smaller matrix #135

Closed buutrg closed 1 month ago

buutrg commented 1 month ago

Hi authors,

I am trying to run rowSums but it seems it runs slower in a smaller matrix:

Larger matrix:

>         peakmat
        system.time({
          a = rowSums(peakmat)
        })
707318 x 604829 IterableMatrix object with class RenameDims

Row names: chr1-161365-161865, chr1-176154-176654 ... chr22-51223654-51224154
Col names: esophagus_mucosa_SM-AZPYJ_rep1#AAACTACCAGGAAGCCGTTGGT, esophagus_mucosa_SM-AZPYJ_rep1#TTATGGATGCTCCCTATAGCCA ... stomach_SM-JF1O3_rep1#CCTTGACGTCGTCAGTAGTCTG

Data type: uint32_t
Storage order: column major

Queued Operations:
1. Load compressed matrix from directory /n/holyscratch01/price_lab/Lab/btruong/atac/single_cell/adult_atlas/peakMat_atlas_adult_BPCells_autochr_lifted_tfidf
2. Reset dimnames
   user  system elapsed 
  5.271   0.571   5.867 

Smaller matrix:

>         peakmat
        system.time({
          a = rowSums(peakmat)
        })
422496 x 139835 IterableMatrix object with class RenameDims

Row names: chr1-730312-730812, chr1-752484-752984 ... chr22-51221826-51222326
Col names: HCAHeartST10773166_HCAHeartST10781063_GCGGTTGGTCCGCTGT-1, HCAHeartST10773166_HCAHeartST10781063_TGACTTCGTTTGGTTC-1 ... HCAHeartST10773171_HCAHeartST10781448_CCATTATTCACATTGA-1

Data type: uint32_t
Storage order: column major

Queued Operations:
1. Load compressed matrix from directory /n/holyscratch01/price_lab/Lab/btruong/atac/single_cell/heartcellatlas/peakMat_atlas_heart_BPCells_autochr_lifted_tfidf                  
2. Reset dimnames

   user  system elapsed 
120.545  43.102 164.122

I am using the same resource: 16 cores, 32GB RAM, same chip processor Can you suggest what could be going wrong here? Your help is really appreciated!

Best, Buu


Update: I just realize that when converting to dgCMatrix in the small matrix, 0 entries are still kept as 0 while in the large matrix, it is as "."

buutrg commented 1 month ago

I solved it, it seems there were some problem when converting different data format to dgCMatrix/IterableMatrix that preserved 0 entries

bnprks commented 1 month ago

Hi @buutrg, glad you were able to figure out your issue! I'll also mention that if you have 16 cores, you could consider the calculating row means rather than row sums using the matrix_stats() function which has easy support for multi-threading. (There's also the internal BPCells::parallel_split() function that matrix_stats() uses internally, though that's a bit more error-prone to use)