Bioconductor / HDF5Array

HDF5 backend for DelayedArray objects
https://bioconductor.org/packages/HDF5Array
11 stars 13 forks source link

Coerce non-integer shapes into integers in the `H5SparseMatrix` constructor #48

Closed LTLA closed 2 years ago

LTLA commented 2 years ago

Using this file as an example:

library(HDF5Array)
H5SparseMatrix("pbmc4k-tenx.h5", "matrix")
## <33694 x 4340> sparse matrix of class H5SparseMatrix and type "integer":
## etc. etc. looks fine.

However, it seems like there are many files where the shape is saved as a Uint64. This causes problems in some of the H5SparseMatrixSeed constructors where the HDF5Array C code reads them as doubles. To reproduce, we can replace the shape dataset with its Uint64 counterpart (this requires h5py as I can't figure out how to do that with rhdf5):

import shutil
src = "pbmc4k-tenx.h5"
dest = "promoted.h5"
shutil.copyfile(src, dest)

import h5py
import numpy
with h5py.File(dest, "a") as handle:
    mhandle = handle["matrix"]
    dims = mhandle["shape"][:]
    del mhandle["shape"]
    promoted = dims.astype(numpy.uint64)
    mhandle.create_dataset("shape", data = promoted)

And then:

H5SparseMatrix("promoted.h5", "matrix")
## Error in validObject(.Object) :
##  invalid class “CSC_H5SparseMatrixSeed” object: invalid object for slot "dim" in class "CSC_H5SparseMatrixSeed": got class "numeric", should be or extend class "integer"

Some testing suggests that just setting as.integer=TRUE in the read_h5sparse_component call in .read_h5sparse_dim would be sufficient to get the example above working.

Session information ``` R Under development (unstable) (2022-02-11 r81718) Platform: x86_64-apple-darwin19.6.0 (64-bit) Running under: macOS Catalina 10.15.7 Matrix products: default BLAS: /Users/luna/Software/R/trunk/lib/libRblas.dylib LAPACK: /Users/luna/Software/R/trunk/lib/libRlapack.dylib locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 attached base packages: [1] stats4 stats graphics grDevices utils datasets methods [8] base other attached packages: [1] HDF5Array_1.25.0 rhdf5_2.39.6 DelayedArray_0.21.2 [4] IRanges_2.29.1 S4Vectors_0.33.15 MatrixGenerics_1.7.0 [7] matrixStats_0.61.0 BiocGenerics_0.41.2 Matrix_1.4-1 loaded via a namespace (and not attached): [1] compiler_4.2.0 tools_4.2.0 rhdf5filters_1.7.0 grid_4.2.0 [5] lattice_0.20-45 Rhdf5lib_1.17.3 ```
wmacnair commented 2 years ago

I had the same issue (when trying to import h5 files saved by CellBender).

In case it's useful to anyone else, my workaround was to apply this function (the same as @LTLA's but reversed) to all the cellbender output h5 files. The fixed version could then be loaded by DropletUtils::read10xCounts.

def fix_cellbender_h5(s, bender_dir):
  # copy file
  src     = os.path.join(bender_dir, f"cellbender_{s}_filtered.h5")
  dest    = os.path.join(bender_dir, f"cellbender_{s}_filtered_fixed.h5")
  shutil.copyfile(src, dest)

  # fix shape integers
  with h5py.File(dest, "a") as handle:
    mat_handle  = handle["matrix"]
    dims        = mat_handle["shape"][:]
    del mat_handle["shape"]
    dims_fixed  = dims.astype(numpy.intc)
    mat_handle.create_dataset("shape", data = dims_fixed)
hpages commented 2 years ago

Thanks for the report. Should be fixed in HDF5Array 1.24.1 (release) and 1.25.1 (devel).

H.