Closed ycli1995 closed 8 months ago
Thanks for this work, it looks overall good! I have two comments/requests then I think things will mostly be ready to merge in:
group
argument for the 10x matrix functions? I know for h5ad there are different matrix layers a user might load, along with related formats like muon which alter the hdf5 group a matrix would be loading from. Are there equivalent cases for the 10x format you're worried about here? If not, I'd prefer to keep the number of possible arguments as small as possible if it won't cut out important functionality.test-matrix_io.R
that just read/write 10x matrices of a few different types? Even a tiny 3x5 matrix is sufficient, we just want something that will flag if this gets broken accidentally in a future change.Hi, @bnprks .
group
can offer an option to read/write multiple-genome 10x matrix for the old version (cellranger v2), that is, just select the used group ("mm10" or "hg19" for example). This behavior is the same as HDF5Array::TENxMatrix
. We just need to set 'matrix' as default to fit the v3 format which is currently used in most 10x data, and let users to change the group
when they meet v2 format.Hi @ycli1995, thanks for adding those tests. I think I might limit it to just one small matrix size when I do my final edit pass, but otherwise looks great. (This is because I'm not really worried about bugs that only show up on large/small matrices for this functionality)
Regarding the group
option, I think being able to write a cellranger v2 file is not so important since the standard has been changed for >5 years now. But your point about being able to read multi-genome v2 files is still reasonable. One other option I just thought of would be to handle multiple genomes automatically for the user by calling ConcatRows
in open10xFeatureMatrix
? I'm not sure it's better than adding an extra argument, though, since it would be tricky for a user to find which genome went to which rows.
What would you think of the following:
group
argument to genome
in open_matrix_10x_hdf5
, matching how scanpy.read_10x_h5()
works, which is required to read a multi-genome v2 file but otherwise can be NULLgroup
argument in write_matrix_10x_hdf5
, since we don't properly support writing the v2 metadata layout anyhowFor manual testing purposes, I dug up this old dataset that is just 100 cells with a v2 multi-genome matrix
Honestly, the group
argument was added to satisfy getAnnDataMatrixType(file, group)
in the first place for me. :)
However, it did remind me of reading multi-genome v2 matrices, which may be kind of a useful side-effect. I think the manners of Seurat::Read10X_h5
may be a good example for automatically handling multiple genomes or modalities.
https://github.com/satijalab/seurat/blob/656fc8b562d53e5d0cedda9e09d9dda81e8c00e9/R/preprocessing.R#L1058
When the user meet multiple genomes or modalities, the Read10X_h5
returns a list of matrices and show an info message. We can follow this pattern in BPCells. I might find a way to first automatically detect the H5 groups before actually open the matrices, so that we can omit the group
argument in open_matrix_10x_hdf5
. In this way, open_matrix_10x_hdf5
just need to open the matrices group by group, and return a list of IterableMatrix. Besides, the group
argument in write_matrix_10x_hdf5
can be dropped since it isn't required for the final returned open_matrix_10x_hdf5(...)
.
I'll do further exploration first and let you know whe I get something clear.
I finally decide to just omit the group
argument in both open_matrix_10x_hdf5
and write_matrix_10x_hdf5
on R side. In this way, the only difference that users may feel is that open_matrix_10x_hdf5
will return a list when they meet a multi-genome v2 h5 file.
If you want to keep exactly the same as previous version, just force all_groups <- all_groups[1]
in open_matrix_10x_hdf5
, which means it will always only read the first group of HDF5 file.
Personally, I tend to leave the responsibilty of complex conditional judgements related to the specific situations to R side, and let each c++ function to do only one thing well. In the 10x context, for example:
open10xFeatureMatrix
just need to open a sparse matrix from a specified HDF5 group, because a group is the minimal unit for a complete matrix. And it should not care and access anything outside this group.open10xFeatureMatrix
to handle the extension to fit both single- and multi-group situations. For example, shall it return a list of matrices? or return a merged matrix? This cannot be hold by c++, as it requires static class for returned value. But in the wrapper function, it can be easily switched and extended.Hi @ycli1995, sorry for the delay but thanks for making these improvements. Your idea to have open_matrix_10x_hdf5
return a list for the old-style multi-genome files seems like a very good solution! The fact that it lets us avoid introducing a group
argument and just automatically be useful to the end-user is very nice.
I did realize that adding slots to the 10xMatrixH5
object could cause compatibility issues for people who read an old object via readRDS()
, but I don't think that can be avoided if we want to make this improvement.
Other than that, I've gone through and made a few final edits then merged in the changes. Thanks again!
issue https://github.com/bnprks/BPCells/issues/74