Bioconductor / HDF5Array

HDF5 backend for DelayedArray objects
https://bioconductor.org/packages/HDF5Array
11 stars 13 forks source link

H5ADMatrix objects looking up the wrong names #40

Closed LTLA closed 3 years ago

LTLA commented 3 years ago

There are sparse matrices in other groups than in X and layers, and these don't have the same dimensions as the assays, e.g.

library(HDF5Array)
path <- system.file("extdata", "example_anndata.h5ad", package="zellkonverter")
mat <- H5ADMatrix(path, "obsp/connectivities")
dim(mat)
## [1] 200 200
lengths(dimnames(mat)))
## [1] 459 200

This causes obvious problems, e.g.,

as(mat, "sparseMatrix")
## Error in validObject(r) : 
##   invalid class “dgTMatrix” object: length(Dimnames[1]) differs from Dim[1] which is 200

The simplest solution is to not try to extract the dimnames. I'm completely fine with not having dimnames naturally attached to the H5ADMatrix, much like I wouldn't expect to get dimnames out of a HDF5Matrix in most scenarios.

Session information ``` R Under development (unstable) (2021-01-24 r79876) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 18.04.5 LTS Matrix products: default BLAS: /home/luna/Software/R/trunk/lib/libRblas.so LAPACK: /home/luna/Software/R/trunk/lib/libRlapack.so locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] parallel stats4 stats graphics grDevices utils datasets [8] methods base other attached packages: [1] HDF5Array_1.19.10 rhdf5_2.35.2 DelayedArray_0.17.10 [4] IRanges_2.25.7 S4Vectors_0.29.15 MatrixGenerics_1.3.1 [7] matrixStats_0.58.0 BiocGenerics_0.37.1 Matrix_1.3-2 loaded via a namespace (and not attached): [1] compiler_4.1.0 tools_4.1.0 rhdf5filters_1.3.4 grid_4.1.0 [5] lattice_0.20-41 Rhdf5lib_1.13.4 ```
hpages commented 3 years ago

Hmm.. the purpose of the H5ADMatrix() constructor was to get the main matrix from the h5ad file, not to get an arbitrary matrix from any possible corner of the file. I'll clarify in the man page. Right now getting the "/obsp/connectivities" matrix can be done with:

> DelayedArray(H5SparseMatrixSeed(path, "/obsp/connectivities"))
<200 x 200> sparse matrix of class DelayedMatrix and type "double":
             [,1]       [,2]       [,3] ...    [,199]    [,200]
  [1,]          0          0          0   .         0         0
  [2,]          0          0          0   .         1         0
  [3,]          0          0          0   .         0         0
  [4,]          0          0          0   .         0         0
  [5,]          0          0          0   .         0         0
   ...          .          .          .   .         .         .
[196,] 0.00000000 0.00000000 0.00000000   . 0.0000000 0.0000000
[197,] 0.00000000 0.08533925 0.92491716   . 0.3622735 0.0000000
[198,] 0.00000000 0.00000000 0.00000000   . 0.0000000 0.0000000
[199,] 0.00000000 1.00000000 0.00000000   . 0.0000000 0.0000000
[200,] 0.00000000 0.00000000 0.00000000   . 0.0000000 0.0000000

Of course we should be able to do H5SparseMatrix(path, "/obsp/connectivities"). The reason I didn't implement the H5SparseMatrix class + constructor yet is because I didn't have a use case... until now. H5SparseMatrix() will always return a naked matrix (no dimnames).

I'll also try to improve error handling to prevent the user from ending up in the situation you reported above.

So we'll end up with 4 constructor functions in the HDF5Matrix package. 2 general purpose:

and 2 specialized:

Maybe a little confusing for the end user. To mitigate this I'll need to improve error handling in each constructor. Ideally they should be able to fail with an error message that redirects the user to the right constructor to use.

LTLA commented 3 years ago

Of course we should be able to do H5SparseMatrix(path, "/obsp/connectivities").

This constructor would need to dispatch to different subclasses depending on whether the sparse matrix was formatted in the 10X style (dimensions in /matrix/shape) or in the H5AD style (dimensions in the attributes). Would the constructor be comfortable automatically making this decision?

If not, it seems like it would be easier for everyone to just have an option like H5ADMatrixSeed(load.dimnames=FALSE). We wouldn't need a H5SparseMatrix subclass that tries to guess the format, while the user wouldn't need to worry about which constructor to use. In fact, you could even catch the difference in the dim and lengths(dimnames) and raise a graceful error/warning about setting load.dimnames=FALSE.

hpages commented 3 years ago

This constructor would need to dispatch to different subclasses depending on whether the sparse matrix was formatted in the 10X style.

The seed constructor already does this kind of dispatch:

> H5SparseMatrixSeed(path, "/obsp/connectivities")
<200 x 200> sparse matrix of class CSC_H5SparseMatrixSeed and type "double":
# dirname: /home/hpages/R/R-4.1.r80083/library/zellkonverter/extdata
# basename: example_anndata.h5ad
# group: /obsp/connectivities

The exact type of the returned seed is either CSC_H5SparseMatrixSeed or CSR_H5SparseMatrixSeed, depending on the direction of the compression. Both classes are concrete subclasses of virtual seed class H5SparseMatrixSeed.

H5SparseMatrix() will do more or less what H5ADMatrix() does except that it will return an H5SparseMatrix instance instead of an H5ADMatrix instance and it will leave the dimnames alone.

LTLA commented 3 years ago

My main concern is that, if you provide a H5SparseMatrix constructor, the name implies that it'll generally work on any HDF5-backed sparse format. This seems like a big promise to make, given that we don't have a HDF5 sparse standard and people are liable to come up with their own ways of packing sparse matrices in there. (By comparison, HDF5Array() works fine because it corresponds exactly to the standardized concept of a HDF5 dataset.) I guess you could document H5SparseMatrix() as only being sensible for H5AD and 10X matrices, but then a user might as well just call those two constructors directly.

I say this because I want to write my own HDF5-backed sparse matrix subclasses - indeed, possibly inheriting from some virtual H5SparseMatrix class - in packages separate from HDF5Array. If users are conditioned to use H5SparseMatrix() to construct their objects, they may well expect that it would "just work" with all sparse formats, which it won't for these custom sparse formats because there's no way for the constructor to discover the relevant third-party package. Using format-specific constructors would be clearer and more intuitive for the user, and eliminate the need for more dispatch logic.

But anyway, getting back to the actual problem: the H5ADMatrix is already being used to pull things from corners other than the main matrix X, given that people will expect it to work on the various assays in the layers. This leads to the natural expectation that if it works for X, and it works for layers... why doesn't it just work for obsp? I suspect that users will try and use it anyway - I mean, I did - and then get surprised when the dimnames don't match up, leading to this issue being raised again and again.

To me, the solution is as simple as modifying H5ADMatrixSeed like so:

H5ADMatrixSeed <- function(filepath, name="X", load.dimnames=TRUE)
{
    if (!isSingleString(filepath))
        stop(wmsg("'filepath' must be a single string specifying the ",
                  "path to the h5ad file where the matrix is located"))
    filepath <- file_path_as_absolute(filepath)
    if (load.dimnames) {
        dimnames <- .load_h5ad_dimnames(filepath)
    } else {
        dimnames <- list(NULL, NULL)
    }

    # the rest of the stuff

    if (load.dimnames && !identical(lengths(dimnames), dim(ans))) {
        warning("dimnames do not match the dimensions of the array, set 'load.dimnames=FALSE' to ignore all dimnames")
        dimnames <- list(NULL, NULL)
    }

    ans
}

I don't think this breaks any aspect of the H5ADMatrix contract; the only relevant piece in the documentation is that a H5ADMatrix is a matrix stored in the H5AD file, it doesn't say anything about the rows being genes and the columns being cells.

hpages commented 3 years ago

Check HDF5Array 1.19.13 (commit d7fd6dcf035b195d009db49727276a49d251ff93).

it doesn't say anything about the rows being genes and the columns being cells.

Now it does (kind of).

LTLA commented 3 years ago

Seems unnecessarily restrictive, but I suppose it doesn't matter to the average user, given that zellkonverter::readH5AD will be wrapping it into an SCE anyway. It's likely that I'll just use the H5SparseMatrix constructor inside readH5AD() to avoid redundant reads of the dimnames, and also to avoid having to special-case /X.

On that note, can we have a format= option in the H5SparseMatrixSeed constructor that instructs .read_h5sparse_dim() to not try and guess the format? If I know it's a sparse matrix in a H5AD file, I don't want the function getting misled by a stray /shape in the same group. I have a few applications where we're adding more pieces of information for each assay in its group, and I don't want to have to remember that shape is a protected field just because it's used by 10X.

hpages commented 3 years ago

Come on! shape belongs to the group that contains the CSR or CSC components of a sparse matrix. Why would you add anything there? And if you do, I hope you're not going to add a shape dataset that describes something that is not the shape of the sparse matrix.

Anyways, this is kind of a separate issue. If you want to pursue that request, please do so in a new issue.