bnprks / BPCells

Scaling Single Cell Analysis to Millions of Cells
https://bnprks.github.io/BPCells
Other
134 stars 11 forks source link

BPCells to h5ad #93

Open Sirin24 opened 3 weeks ago

Sirin24 commented 3 weeks ago

Thank you for this great tool. I have finished integrating and annotating my data which could not have been done without this tool. Now I need to convert the object to h5ad to do some downstream analysis.

write_matrix_hdf5(obj[["RNA"]]$counts, "counts.h5")

I am wondering how to save the data and scaled values as well, and the metadata, information about clusters and umap to have a complete object in h5ad.

bnprks commented 3 weeks ago

Hi @Sirin24, glad that BPCells has been useful for you! If you want to write to h5ad, first thing you'll want is to use the function write_matrix_anndata_hdf5(). This will write to the main X matrix by default, but you can pass group="layers/normalized" for example to write to a specific layer name of your choosing. You would call it once for each matrix you want to write to the h5ad file. (The write_matrix_hdf5 function you mentioned writes a BPCells-specific format, and is not what you want for this use-case).

For saving metadata, cluster, and umap information this is not something that BPCells handles directly (we focus on matrices and scATAC fragment data). My recommendation would be that you write this data to a file in csv/tsv format, then load it in python and add it to your h5ad file using the python anndata library.

If you are feeling confident you can modify the h5ad file directly from R using the hdf5r library, but it's more error-prone and you'd need to also carefully read the anndata specification for handling things like cluster names or other categorical metadata

-Ben

Flu09 commented 1 week ago

Would you kindly tell what I am doing wrong? I am experimenting on a small object but I have a big object which I will need to convert into h5ad. First I wrote only the counts layer to disk, then normalized, clustered and annotated my object using seurat, then i want to convert it into h5ad.

#Write the joined counts layer to a directory
#write_matrix_dir(mat = mergedsamples[["RNA"]]$counts, dir = '/tmp/object_mergedsamples')
counts.mat.object <- open_matrix_dir(dir = "/tmp/object_mergedsamples")
counts.mat.object
mergedsamples[["RNA"]]$counts <- counts.mat.object

mergedsamples
An object of class Seurat 
40090 features across 10799 samples within 1 assay 
Active assay: RNA (40090 features, 2000 variable features)
 3 layers present: data, counts, scale.data
 2 dimensional reductions calculated: pca, umap
> # Write count matrix
> write_matrix_anndata_hdf5(
+   mat = mergedsamples[["RNA"]]$counts,
+   path = "object_mergedsamples.h5ad",
+   group = "X"
+ )
40090 x 10799 IterableMatrix object with class AnnDataMatrixH5

Row names: ENSG00000228037, PRDM16 ... ENSG00000277475
Col names: AAACCCAAGCCGAATG-1_1, AAACCCACAAGCTCTA-1_1 ... TTTGTCATCTTGCATT-1_4

Data type: double
Storage order: column major

Queued Operations:
1. AnnData HDF5 matrix in file /tmp/object_mergedsamples.h5ad, group X

> # Write data matrix
> write_matrix_anndata_hdf5(
+   mat = mergedsamples[["RNA"]]$data,
+   path = "object_mergedsamples.h5ad",
+   group = "layers/data"
+ )
Error: In function write_matrix_anndata_hdf5: "mat" must have class "IterableMatrix"
    ▆
 1. └─BPCells::write_matrix_anndata_hdf5(...)
 2.   └─BPCells:::assert_is(mat, "IterableMatrix")

> # Write scaled data
> write_matrix_anndata_hdf5(
+   mat = mergedsamples[["RNA"]]$scale.data,
+   path = "object_mergedsamples.h5ad",
+   group = "layers/scale.data"
+ )
Error: In function write_matrix_anndata_hdf5: "mat" must have class "IterableMatrix"
    ▆
 1. └─BPCells::write_matrix_anndata_hdf5(...)
 2.   └─BPCells:::assert_is(mat, "IterableMatrix")
> 
bnprks commented 1 week ago

Hi @Flu09, based on the error message it looks like somehow you constructed your Seurat mergedsamples object to have a BPCells-backed matrix for the counts layer, but not for the data and scale.data layers. The key error message text is this: Error: In function write_matrix_anndata_hdf5: "mat" must have class "IterableMatrix"

Have you tried checking the type of your data and scale.data layers with class(mergedsamples[["RNA"]]$data)?

In normal usage of Seurat, if you start with a BPCells matrix at the beginning and then run NormalizeData(), and ScaleData() those layers should end up with BPCells-backed matrices, but if you manually assign the counts layer after running those functions, those layers will still be in-memory matrices (probably dgCMatrix for data and matrix for scale.data).

You can wrap a dgCMatrix as a BPCells IterableMatrix by running as(mergedsamples[["RNA"]]$data, "IterableMatrix"), and converting from a dense R matrix requires a two-step process, converting first to dgCMatrix then IterableMatrix: mergedsamples[["RNA"]]$scale.data |> as("dgCMatrix") |> as("IterableMatrix")

In summary, if I've diagnosed your problem correctly, then your two options are either to run your normalization starting with a BPCells-backed counts matrix or to manually convert in-memory matrices to BPCells matrices as I described above.