bnprks / BPCells

Scaling Single Cell Analysis to Millions of Cells
https://bnprks.github.io/BPCells
Other
166 stars 17 forks source link

Need some help to extract data from a shared object. #46

Closed wgmao closed 1 year ago

wgmao commented 1 year ago

I am a beginner to use BPCells. Thank you very much for developing and maintaining this awesome package! My collaborator shared a large collection of multiome data as a Seurat object (Assay5), and I would like to extract the RNA count matrix from the object. The count matrix obj@assays$RNA@layers$counts comes with the following description.

36588 x 10 IterableMatrix object with class RenameDims

Row names: A1BG, A1BG-AS1 ... ZZEF1
Col names: A_AAACAGCCAAACCTAT-1, A_AAACGTACACGAACAG-1 ... A_AACTCACAGGATTGAG-1

Data type: uint32_t
Storage order: column major

Queued Operations:
1. Load compressed matrix from directory "/the path that hosts the data"
2. Select rows: 1, 2 ... 36588 and cols: 826155, 826180 ... 826291
3. Reset dimnames

I tried two commands temp %>% write_matrix_memory(compress = F) and as.matrix(obj@assays$RNA@layers$counts) that lead to the same error Missing directory: /the path that hosts the data.

bnprks commented 1 year ago

One clarification question -- I see in the first line under "Queued Operations" it says it's loading compressed matrix from directory "/the path that hosts the data". This should be an actual path to files on the filesystem, e.g. "/home/wgmao/datafiles/matrix_folder". Have you edited out the actual path for privacy, or is that the object you were given?

The command as.matrix() normally should work, but if the underlying data folder doesn't exist, then it will error like you've seen.

The way BPCells objects work in R is that the R object stores any queued operations in successive layers (e.g. subsetting/normalization), and rather than store the actual matrix data in R, it just stores the path of files on disk that hold the actual matrix data.

In terms of sharing, there are two easy options, and a third more complicated but flexible option:

  1. Transfer the BPCells matrix directory, re-open via open_matrix_dir(), and re-run any processing from raw data. This is the easiest and most reliable option
  2. If you are on a shared filesystem, it is possible to save a normalized BPCells matrix via saveRDS(), then another user can open the object and access the data with readRDS(), provided that the underlying matrix directory has not moved and is accessible by both users
  3. It is possible to share a normalized BPCells matrix, migrate the underlying data, and edit the R object to point to the data on disk. This would use the all_matrix_inputs() function in BPCells, but is a bit tricky. I believe Seurat is aiming to do this automatically when running saveRDS() on a Seurat object containing BPCells matrices.

This third option will also help if you want to discard embedded operations and access the original raw data.

There are of course some exceptions to this rule that the underlying data must be shared in addition to the R object -- for example, BPCells also allows saving matrices fully in-memory via write_matrix_memory(). But for your case I think the data you need is supposed to be shared via a directory that you don't have.

wgmao commented 1 year ago

Thank you for the prompt response! Yes, the actual path includes personal information. I replaced it with this pseudo path. Your suggestions make a lot of sense to me. I have two follow-up questions:

bnprks commented 1 year ago
  1. 200MB is not unusual for an unfiltered 10x run. E.g. if I load this public 10x dataset I also get 200MB, and closer inspection reveals that 99% of that space is used storing the cell barcodes in memory. I might stop storing cell barcodes in memory in the future if I can be sure it won't negatively affect speed. If you write a filtered version of the dataset out to disk and load that, the memory usage will be dramatically lower (I get 3.7 MB on this dataset).
  2. I've included a code example below showing all_matrix_inputs() -- notice how it allows assignment via <- as well as querying (though my example doesn't show querying in this case). For clarity in the example I've shown a case where the underlying data is not identical, though for your case you'd probably want dir2 to be a copy of dir1
Code example with all_matrix_inputs ``` r library(BPCells) library(magrittr) dir1 <- tempfile("tmp_test1") dir2 <- tempfile("tmp_test2") mat1 <- matrix(1:12, nrow=3) %>% as("dgCMatrix") %>% write_matrix_dir(dir1) as.matrix(mat1) #> [,1] [,2] [,3] [,4] #> [1,] 1 4 7 10 #> [2,] 2 5 8 11 #> [3,] 3 6 9 12 mat2 <- matrix(-(1:12), nrow=3) %>% as("dgCMatrix") %>% write_matrix_dir(dir2) as.matrix(mat2) #> [,1] [,2] [,3] [,4] #> [1,] -1 -4 -7 -10 #> [2,] -2 -5 -8 -11 #> [3,] -3 -6 -9 -12 mat <- open_matrix_dir(dir1) + 5 mat #> 3 x 4 IterableMatrix object with class TransformScaleShift #> #> Row names: unknown names #> Col names: unknown names #> #> Data type: double #> Storage order: column major #> #> Queued Operations: #> 1. Load compressed matrix from directory /private/var/folders/98/wxt58r8s11vg7qzb4s26l8h40000gn/T/Rtmpp2RV5j/tmp_test1431a22c5fc55 #> 2. Shift by 5 as.matrix(mat) #> [,1] [,2] [,3] [,4] #> [1,] 6 9 12 15 #> [2,] 7 10 13 16 #> [3,] 8 11 14 17 all_matrix_inputs(mat) <- list(open_matrix_dir(dir2)) mat #> 3 x 4 IterableMatrix object with class TransformScaleShift #> #> Row names: unknown names #> Col names: unknown names #> #> Data type: double #> Storage order: column major #> #> Queued Operations: #> 1. Load compressed matrix from directory /private/var/folders/98/wxt58r8s11vg7qzb4s26l8h40000gn/T/Rtmpp2RV5j/tmp_test2431a74394248 #> 2. Shift by 5 as.matrix(mat) #> [,1] [,2] [,3] [,4] #> [1,] 4 1 -2 -5 #> [2,] 3 0 -3 -6 #> [3,] 2 -1 -4 -7 ``` Created on 2023-09-14 with [reprex v2.0.2](https://reprex.tidyverse.org)
wgmao commented 1 year ago

Thank you so much for your detailed response! I have to say you are one of the most responsive developers I have met so far! I really appreciate it.