bnprks / BPCells

Scaling Single Cell Analysis to Millions of Cells
https://bnprks.github.io/BPCells
Other
134 stars 11 forks source link

saving a merged BPCells object #90

Open Flu09 opened 1 month ago

Flu09 commented 1 month ago

I had multiple seurat objects and I wrote each to disk then merged them. How to save this object ? Do I just saveRDS()? or do I write it into a matrix first?

write_matrix_dir(mat = object1[["RNA"]]$counts, dir = '/tmp/object1') counts.mat.object1 <- open_matrix_dir(dir = "/tmp/object1") counts.mat.object1 object1[["RNA"]]$counts <- counts.mat.object1

write_matrix_dir(mat = object2[["RNA"]]$counts, dir = '/tmp/object2') counts.mat.object2 <- open_matrix_dir(dir = "/tmp/object2") counts.mat.object2 object2[["RNA"]]$counts <- counts.mat.object2

merged <- merge (object1, c(object2, object3, object4))

bnprks commented 1 month ago

Hi @Flu09, either approach should be fine and all downstream operations should work regardless of the choice you make.

saveRDS() is probably the easiest approach, just remember that the BPCells object will store the absolute path for its input files, so if those files are moved/deleted the RDS object won't be able to find the data.

Alternatively, you can merge multiple matrices into a single file by calling write_matrix_dir(), which will write a new file to disk. This can be preferable if you have a large number of samples to improve performance, or if you want to more easily be able to copy the matrix files to a new computer.

Hope that helps -Ben

Flu09 commented 1 month ago

Thanks. May I ask about another issue? but i feel it might be related more to Seurat.

I faced this error when integrating. Is something wrong with the structure of the object?

merged <- NormalizeData(merged) merged[["RNA"]] <- split(merged[["RNA"]], f = merged$dataset) merged <- JoinLayers(merged) merged <- FindVariableFeatures(merged) merged <- SketchData(object = merged, ncells = 10000, method = "LeverageScore", sketched.assay = "sketch") DefaultAssay(merged) <- "sketch" merged <- FindVariableFeatures(merged, verbose = F) merged <- ScaleData(merged, verbose = F) merged <- RunPCA(merged, verbose = F)

merged <- IntegrateLayers(

  • object = merged, method = FastMNNIntegration,
  • new.reduction = "integrated.mnn",
  • verbose = TRUE
  • ) Converting layers to SingleCellExperiment Running fastMNN Error in validObject(.Object) : invalid class "ScaledMatrix" object: the supplied seed must support extract_array()
bnprks commented 1 month ago

This error looks like it's more on the Seurat side. From the error message, I think that Seurat must be calling a function that expects to receive a DelayedArray object but it is instead getting a BPCells object passed to it.

I think the underlying bioconductor package batchelor with its function fastMNN is probably the limiting factor here -- the algorithm would probably work if passed PCA dimensions directly but it doesn't seem to support that kind of input, hence why it even gets passed a BPCells matrix in the first place (which it then doesn't know how to deal with). It's possible the Seurat folks could come up with a workaround, but there's not a good way I'm aware of that could fix this from the BPCells side.

Flu09 commented 4 weeks ago

Thank you. I want to ask again about saving the RDS object. What I understood is that we could move the BP folder to a new location. I moved the contents from /tmp/obj2 to/tmp/tmp/obj2

counts.mat.obj2 <- open_matrix_dir(dir = "/tmp/tmp/obj2")
counts.mat.obj2
obj2[["RNA"]]$counts <- counts.mat.obj2
markers <- FindMarkers(obj2, ident.1 = 23, ident.2 = 3)
Error: Missing directory: /tmp/obj2

Another question I have is how to save the metadata of a BP object then read them back if needed

bnprks commented 3 weeks ago

From your example, I think a problem you have is that you have already normalized your Seurat object prior to moving the BPCells folder. Therefore, the data layer (obj2[["RNA"]]$data) will still be a BPCells object that points to the old directory. It is possible to manually adjust that object as well, using all_matrix_inputs(). e.g.:

all_matrix_inputs(obj2[["RNA"]]$data) <- list(open_matrix_dir(dir="/tmp/tmp/obj2"))

This can be somewhat error-prone if you are merging together several different data sources. In general I would recommend not moving the data for a BPCells-based project more than you absolutely have to. (note that BPCells doesn't modify files on disk unless you explicitly call write_matrix_dir() with overwrite=TRUE, so multiple objects can read from the same data source without interfering with each other)

As for saving metadata, BPCells itself doesn't handle very much metadata, just row names and column names for matrices. Most of Seurat's metadata is handled by Seurat itself and doesn't get put into BPCells. So in that case, a normal saveRDS should suffice to store/load your Seurat object.