Try adding supports for non-integer data for 10x HDF5

ycli1995 commented 8 months ago

issue https://github.com/bnprks/BPCells/issues/74

bnprks commented 8 months ago

Thanks for this work, it looks overall good! I have two comments/requests then I think things will mostly be ready to merge in:

Do we need to add the group argument for the 10x matrix functions? I know for h5ad there are different matrix layers a user might load, along with related formats like muon which alter the hdf5 group a matrix would be loading from. Are there equivalent cases for the 10x format you're worried about here? If not, I'd prefer to keep the number of possible arguments as small as possible if it won't cut out important functionality.
Could you add some tests in test-matrix_io.R that just read/write 10x matrices of a few different types? Even a tiny 3x5 matrix is sufficient, we just want something that will flag if this gets broken accidentally in a future change.

ycli1995 commented 8 months ago

Hi, @bnprks .

Adding group can offer an option to read/write multiple-genome 10x matrix for the old version (cellranger v2), that is, just select the used group ("mm10" or "hg19" for example). This behavior is the same as HDF5Array::TENxMatrix. We just need to set 'matrix' as default to fit the v3 format which is currently used in most 10x data, and let users to change the group when they meet v2 format.
You're right. I shall soon add some test codes to the main branch. Thanks.

bnprks commented 8 months ago

Hi @ycli1995, thanks for adding those tests. I think I might limit it to just one small matrix size when I do my final edit pass, but otherwise looks great. (This is because I'm not really worried about bugs that only show up on large/small matrices for this functionality)

Regarding the group option, I think being able to write a cellranger v2 file is not so important since the standard has been changed for >5 years now. But your point about being able to read multi-genome v2 files is still reasonable. One other option I just thought of would be to handle multiple genomes automatically for the user by calling ConcatRows in open10xFeatureMatrix? I'm not sure it's better than adding an extra argument, though, since it would be tricky for a user to find which genome went to which rows.

What would you think of the following:

We rename the group argument to genome in open_matrix_10x_hdf5, matching how scanpy.read_10x_h5() works, which is required to read a multi-genome v2 file but otherwise can be NULL
We drop the group argument in write_matrix_10x_hdf5, since we don't properly support writing the v2 metadata layout anyhow

For manual testing purposes, I dug up this old dataset that is just 100 cells with a v2 multi-genome matrix

ycli1995 commented 8 months ago

Honestly, the group argument was added to satisfy getAnnDataMatrixType(file, group) in the first place for me. :)

However, it did remind me of reading multi-genome v2 matrices, which may be kind of a useful side-effect. I think the manners of Seurat::Read10X_h5 may be a good example for automatically handling multiple genomes or modalities. https://github.com/satijalab/seurat/blob/656fc8b562d53e5d0cedda9e09d9dda81e8c00e9/R/preprocessing.R#L1058

When the user meet multiple genomes or modalities, the Read10X_h5 returns a list of matrices and show an info message. We can follow this pattern in BPCells. I might find a way to first automatically detect the H5 groups before actually open the matrices, so that we can omit the group argument in open_matrix_10x_hdf5. In this way, open_matrix_10x_hdf5 just need to open the matrices group by group, and return a list of IterableMatrix. Besides, the group argument in write_matrix_10x_hdf5 can be dropped since it isn't required for the final returned open_matrix_10x_hdf5(...).

I'll do further exploration first and let you know whe I get something clear.

ycli1995 commented 8 months ago

I finally decide to just omit the group argument in both open_matrix_10x_hdf5 and write_matrix_10x_hdf5 on R side. In this way, the only difference that users may feel is that open_matrix_10x_hdf5 will return a list when they meet a multi-genome v2 h5 file.

If you want to keep exactly the same as previous version, just force all_groups <- all_groups[1] in open_matrix_10x_hdf5, which means it will always only read the first group of HDF5 file.

Personally, I tend to leave the responsibilty of complex conditional judgements related to the specific situations to R side, and let each c++ function to do only one thing well. In the 10x context, for example:

open10xFeatureMatrix just need to open a sparse matrix from a specified HDF5 group, because a group is the minimal unit for a complete matrix. And it should not care and access anything outside this group.
Let the wrapper function who calls open10xFeatureMatrix to handle the extension to fit both single- and multi-group situations. For example, shall it return a list of matrices? or return a merged matrix? This cannot be hold by c++, as it requires static class for returned value. But in the wrapper function, it can be easily switched and extended.

bnprks commented 8 months ago

Hi @ycli1995, sorry for the delay but thanks for making these improvements. Your idea to have open_matrix_10x_hdf5 return a list for the old-style multi-genome files seems like a very good solution! The fact that it lets us avoid introducing a group argument and just automatically be useful to the end-user is very nice.

I did realize that adding slots to the 10xMatrixH5 object could cause compatibility issues for people who read an old object via readRDS(), but I don't think that can be avoided if we want to make this improvement.

Other than that, I've gone through and made a few final edits then merged in the changes. Thanks again!

bnprks / BPCells

Try adding supports for non-integer data for 10x HDF5 #75