bnprks / BPCells

Scaling Single Cell Analysis to Millions of Cells
https://bnprks.github.io/BPCells
Other
165 stars 17 forks source link

BPCells matrix to sparse matrix #40

Open Dario-Rocha opened 1 year ago

Dario-Rocha commented 1 year ago

I've been using this package for some weeks now and it's amazing how it allows to process big datasets in R. After doing some work, I need to export a trimmed verison of the dataset in the usual sparse matrix format, for compatibility reasons, but I am failing to find a way to do so, would you be so kind to point me into the right direction?

bnprks commented 1 year ago

Hi Dario, glad it's been working well for you! You can just run as(bpcells_mat, "dgCMatrix") to convert to an R sparse matrix

I'll add something to the docs so this is a bit more obvious

Dario-Rocha commented 1 year ago

Thank you for your reply, however I can't manage to do this because

Error: Error opening file: matrix_bp_soup/index_starts

Maybe this is a consequence of the expression matrix in question being stored in a v5 Seurat object that has been processed, saved as RDS and loaded, multiple times. Do you think this is a problem caused by Seurat processing?

bnprks commented 1 year ago

To start with troubleshooting, some basic things to check:

  1. In the working directory of your R project, is there a file with path matrix_bp_soup/index_starts? Do you have read permissions to it?
  2. If you haven't updated BPCells in a bit, you might try reinstalling as it should result in a more informative error message when opening a file fails.
  3. If you just print the bpcells_mat, it should list a set of delayed transformations, starting with the directory where files will be loaded from. Is this where you expect it to be?
  4. (unlikely) Do you happen to be using windows with >63 matrices active at once? This can cause a similar error, though again the issue is largely fixed in the new BPCells versions

It is possible this has to do with Seurat processing, as they are overloading the saveRDS function for Seurat projects. My rough understanding is that for BPCells objects this can result in relocating the original data files to a new directory and updating the BPCells R objects to point to the new file paths. If for some reason the R object has gotten out of sync with where the files are stored, this could cause errors. (Though it is possible to manually patch up broken paths with an experimental BPCells function -- let me know if you need more info on this)

Could you let me know what the results are of the troubleshooting steps I listed above?

Dario-Rocha commented 1 year ago

After updating BPcells, the error is now indicating that too many matrices are opened. Indeed, this seurat object was created from a list of 92 BPCells matrices. I have read permissions to the /index_starts of the matrices, even though they are not stored in the working directory of the script. The first line of queued operations isn't really pointing to a location. It may have something to do with with the layers of the seurat object each being a combination of two BPCells matrices, that's why it indicates 46 matrices while in fact there are 92 samples.

Queued Operations:

  1. Concatenate columns of 46 matrix objects with classes: RenameDims, RenameDims ... RenameDims
  2. Select rows: 1, 2 ... 36601 and cols: 1, 2 ... 925242

I am running MacOS Ventura 13.3.1

bnprks commented 1 year ago

Hi Dario, I see your issue in the Seurat repo -- for the BPCells-specific part of this discussion I think we can keep things here.

Thanks for checking up on those details. I was not aware that Macs also sometimes had a max open files issue, but at least this source claims the default limit is 256, which is too low to handle 92 matrices at once with BPCells right now.

There are two directions you could go for solutions:

  1. Merge some of the BPCells matrices using the rbind or cbind functions, then save out to disk. This can combine multiple files into one. E.g. you could combine 92 matrices -> 8 (and optionally further combine down to 1) as a workaround to the open file limits.
  2. You could adjust the maximum open file limits via the MacOS command line. I believe running ulimit -n 1024 should work, and you could add that in your .zshrc or .bashrc file so you don't have to type it in every time before running R. This stackexchange answer seems to have some more complicated suggestions for a permanent increase in the maximum file limit

It's a bit tricky for BPCells to decrease the number of open files in these cases, so one of those two workarounds is likely your best option in the near term. Let me know if one of those works for you.

Dario-Rocha commented 8 months ago

Hello there, although this is not the exact same issue I've decided to reply here because it's basically the same thing, just with a different error I have a seuratv5 object saved as a .qs file. I was tasked to extract and export the counts matrix, so I tried creating a new BPCells matrix object or converting the BPCells to dgCMatrix, in both cases the error is the same:

no slot of name "threads" for this object of class "ColBindMatrices"

temp_lo <- qread('complete_v01_1_rpcacd25lo_seuratv5_anon.qs')

temp_lo <- temp_lo[["RNA"]]
temp_lo <- JoinLayers(temp_lo)
dim(temp_lo)
temp_lo
temp_lo <- temp_lo$counts

write_matrix_dir(mat = temp_lo, dir = 'file_path',
                 overwrite = TRUE)

temp_lo <- as(temp_lo, "dgCMatrix") Error in iter_function(iterators, x@threads) : no slot of name "threads" for this object of class "ColBindMatrices"

bnprks commented 8 months ago

The cause here appears to be that the file you're loading was created on an earlier version of BPCells, before the threads slot was added to the ColBindMatrices class. Therefore, once you load the object from disk it looks like it is missing the slot.

In this case, assuming that class(temp_lo$counts) is ColBindMatrices, I think it will suffice to run temp_lo$counts@threads <- 1L. I believe this will print out a warning message but after that things should work okay. If the top-level layer of temp$counts is not a ColBindMatrices object, you may need to dig in a couple layers of @matrix slots (e.g. temp_lo$counts@matrix@matrix@threads <- 1L).

Given that there have been similar issues from a different update in #79, it seems I should look into making a helper function to help update BPCells objects from the old versions to the latest version.