bnprks / BPCells

Scaling Single Cell Analysis to Millions of Cells
https://bnprks.github.io/BPCells
Other
134 stars 11 forks source link

BPCells to h5AD #49

Closed jonathan-columbiau closed 8 months ago

jonathan-columbiau commented 9 months ago

Hi,

Thanks for creating this great tool! I'd like to use data I currently have stored in a BPCells matrix with a library only found in Python and that takes in h5ad Anndata files - does BPCells have functionality to write matrices to h5ad?

Thanks!

bnprks commented 9 months ago

Currently no, but I've been waiting for an excuse to add the functionality. I'll try adding it over the next couple days and post back here how it goes

bnprks commented 8 months ago

Sorry for the delay on this, but I've finally pushed an update which adds the function write_matrix_anndata_hdf5(). This should make it possible to write a BPCells matrix either to a standalone h5ad file, or as an extra matrix in an existing h5ad

jonathan-columbiau commented 8 months ago

Thanks, really appreciate it!

Dario-Rocha commented 6 months ago

Hello again, this package is getting everyday better! At the moment I have a similar task as the OP but I R fails to find the function

Error: 'write_matrix_anndata_hdf5' is not an exported object from 'namespace:BPCells'

Even though the function is there in the help I've just reinstalled BPCells package

Thanks a lot again for your help!

bnprks commented 6 months ago

Oh thanks for the heads up @Dario-Rocha! It looks like I hadn't re-generated the NAMESPACE file so the function wasn't getting exported. Should be fixed now by commit f1b9f6bb. (Also for future temporary workarounds, you can use BPCells::: with three colons to access unexported methods though please still let me know if I've forgotten to export something)

Dario-Rocha commented 6 months ago

Great! that worked. Now I am only wondering if BPCells can write the expression matrix along with the obs and vars data in the h5ad file

bnprks commented 6 months ago

Right now the intended behavior is as follows (though correct me if you're seeing something else happen):

  1. BPCells can write sparse matrices either to the main matrix X (by default), a layer of the matrix (by setting the group to layers/my_matrix_name), or even under varm or obsm if desired.
  2. For compatibility purposes, BPCells will write a barebones obs or var group if that doesn't already exist in the file, which just contains a 0-based row or column index.

Assuming you are wanting to write additional data to obs or var, BPCells doesn't have additional support for that right now. I'd be happy to accept a contribution adding this support, but I haven't built it personally because handling factors seems a bit tricky to get right and BPCells doesn't have much other metadata-related functionality right now.

(If you just want a quick way to pass metadata yourself, I'd recommend hdf5r and checking out the Anndata format docs, or even using reticulate to write the metadata directly using the AnnData python package from within R)

Dario-Rocha commented 6 months ago

Thank you!, I understand

ggruenhagen3 commented 4 months ago

I'm posting this here in case it helps others who were in my situation.

I was unable to access the data matrix after converting to an h5ad. For example, trying to access the first 5 genes and cells would give an error that said "ValueError: unsupported data types in input". Eventually, I found a solution on my own. I changed the data type of the matrix to an integer.

{r}
write_matrix_hdf5(obj[["SCT"]]$counts, "sct_counts.h5")
{python}
adata = sc.read_h5ad("sct_counts.h5")
adata.X[0:5, 0:5]              # -> gives the ValueError
adata.X = adata.X.astype(int)  # solution
adata.X[0:5, 0:5]              # this now works
bnprks commented 4 months ago

Thanks for the advice @ggruenhagen3! If I remember correctly the Anndata specification doesn't require any particular data type for matrices stored on disk, but evidently the python package implementation is not so flexible when reading from disk. I suppose your workaround might be the best option for now, or I believe calling mat <- convert_matrix_type(mat, "float") prior to writing in BPCells would be another option to match up with apparent type limitations in scanpy/anndata

ggruenhagen3 commented 4 months ago

@bnprks I had tried convert_matrix_type to every available option (ie "float", "uint32_t", "double"), but all resulted in an in python when trying to use the matrix. The problem may lie with in python with anndata?

For the record, I am using the following versions: R 4.3.1, BPCells_0.1.0, Seurat_5.0.0, anndata in R version 0.7.5.6 (not sure that this one is relevant), python 3.9.18, scanpy 1.9.3, and anndata in python version 0.8.0 (I had tried other versions, including 0.10.something).