Bioconductor / HDF5Array

HDF5 backend for DelayedArray objects
https://bioconductor.org/packages/HDF5Array
9 stars 13 forks source link

Cannot read 10X h5 #47

Closed ekernf01 closed 2 years ago

ekernf01 commented 2 years ago

I want to load a file as the backend for a DelayedArray:

test.h5.gz

Trying this command:

X = HDF5Array::TENxMatrixSeed(filepath = "test.h5",  group  = "HDF5ArrayAUTO00012")

... this error shows. Can you help me figure out where the issue is? Some more info lies below. Thanks!

Error in HDF5Array::TENxMatrixSeed(filepath = "test.h5",  : 
  length(indptr) == dim[[2L]] + 1L is not TRUE

It reads successfully using rhdf5.

> rhdf5::h5ls(                          "test.h5")
                group               name       otype  dclass  dim
0                   / HDF5ArrayAUTO00012   H5I_GROUP             
1 /HDF5ArrayAUTO00012               data H5I_DATASET   FLOAT 3638
2 /HDF5ArrayAUTO00012            indices H5I_DATASET INTEGER 3638
3 /HDF5ArrayAUTO00012             indptr H5I_DATASET INTEGER   10
4 /HDF5ArrayAUTO00012              shape H5I_DATASET INTEGER    2

I created it mostly using HDF5Array::write_block. Here is the function I used.

csvToHdf5 = function(
  block_size,
  input_file,
  output_file,
  array_dimension, 
  force = F,
  ...
){
  input_connection = gzfile(input_file, "r")
  position = 1
  if(!force & file.exists(output_file)){
    stop("File exists. Pass force=T to overwrite.\n")
  }
  file.create(output_file)
  output_sink = HDF5Array::TENxRealizationSink(
    rev(as.integer(array_dimension)),
    filepath = output_file
  )
  while(T){
    is_first_block = (position==1)
    X = NULL
    tryCatch(
      expr = {X = read.csv(input_connection, nrows=block_size, header = is_first_block, ...)},
      error = function(e){
        if( "no lines available in input" == e$message ){
          warning(
            paste0("Reached end of input file earlier than expected: position = ", position, "\n")
          )
        } else {
          stop(e)
        }
      }
    )
    if(is.null(X)){
      break
    }
    if(ncol(X)<=1){
      stop("No delimiters found. Wrong delimiter?\n")
    }
    HDF5Array::write_block(
      output_sink,
      viewport = DelayedArray::ArrayViewport(
        refdim = dim(output_sink), 
        ranges = IRanges::IRanges(
          start = c(1      , position              ), 
          end   = c(ncol(X), position + nrow(X) - 1)
        )
      ), 
      block = t(as.matrix(X))
    )
    position = position + block_size
    if(position>array_dimension[[1]]){
      break
    }
  }
  close(input_connection)
  return(output_sink)
}

Here's my session info.


R version 4.0.0 (2020-04-24)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: elementary OS 5.1.7 Hera

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/openblas/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/libopenblasp-r0.2.20.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8       
 [4] LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C              
[10] LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
 [1] HDF5Array_1.16.1    compiler_4.0.0      Matrix_1.3-4        IRanges_2.22.2      parallel_4.0.0     
 [6] DelayedArray_0.14.1 tools_4.0.0         rhdf5_2.32.4        S4Vectors_0.26.1    grid_4.0.0         
[11] BiocGenerics_0.34.0 matrixStats_0.61.0  stats4_4.0.0        lattice_0.20-45     Rhdf5lib_1.10.1
hpages commented 2 years ago

One thing to understand is that the data must be written column by column to the 10x Genomics dataset, or by blocks of columns. These blocks of columns will correspond to blocks of rows in the input file (CSV). So the final TENxMatrix object will be presented as a transposed dataset with respect to the CSV file.

As a side note you'll avoid a lot of headaches if you use TENxRealizationSink() and write_block() in a more idiomatic way. In particular it's much preferable to use a grid for the main loop. There are many examples in ?write_block for how to do this. In your particular case, it's going to look something like this:

library(HDF5Array)

csvToTENxMatrix <- function(input_file, input_dim, output_file, group, input_block_nrow,
                            force=FALSE, verbose=FALSE, ...)
{
    if (file.exists(output_file)) {
        if (!force)
            stop("File exists. Pass force=TRUE to overwrite.")
        if (unlink(output_file) != 0L)
            stop("failed to delete file \"", output_file, "\"")
    }

    input_connection <- gzfile(input_file, "r")
    output_dim <- rev(as.integer(input_dim))
    sink <- HDF5Array::TENxRealizationSink(output_dim, filepath=output_file, group=group)
    sink_grid <- colAutoGrid(sink, ncol=input_block_nrow)

    ## Walk on the grid, and, for each viewport, read a block from the CSV file and
    ## write it to the h5 file.
    for (bid in seq_along(sink_grid)) {
        viewport <- sink_grid[[bid]]
        if (verbose)
            cat("Reading row(s) ", as.character(ranges(viewport)[2]), " from ", input_file, " ... ", sep="")
        is_first_block <- bid == 1L
        input_block <- read.csv(input_connection, nrows=width(viewport)[2], header=is_first_block, ...)
        if (ncol(input_block) <= 1L)
            stop("No delimiters found. Wrong delimiter?")
        if (verbose)
            cat("OK\n")
        output_block <- t(as.matrix(input_block))
        if (verbose)
            cat("Writing column(s) ", as.character(ranges(viewport)[2]), " to ", output_file, " ... ", sep="")
        sink <- write_block(sink, viewport, output_block)
        if (verbose)
            cat("OK\n")
    }
    close(sink)
    close(input_connection)
    as(sink, "DelayedArray")
}

Then:

m0 <- Matrix::rsparsematrix(100, 50, density=0.1)
write.csv(as.matrix(m0), "test.csv", row.names=FALSE)

tenx <- csvToTENxMatrix("test.csv", input_dim=dim(m0),
                        output_file="test.h5", group="m0",
                        input_block_nrow=30, verbose=TRUE)
# Reading row(s) 1-30 from test.csv ... OK
# Writing column(s) 1-30 to test.h5 ... OK
# Reading row(s) 31-60 from test.csv ... OK
# Writing column(s) 31-60 to test.h5 ... OK
# Reading row(s) 61-90 from test.csv ... OK
# Writing column(s) 61-90 to test.h5 ... OK
# Reading row(s) 91-100 from test.csv ... OK
# Writing column(s) 91-100 to test.h5 ... OK

tenx
# <50 x 100> sparse matrix of class TENxMatrix and type "double":
#         [,1]   [,2]   [,3] ...  [,99] [,100]
#  [1,]   0.00   0.00   0.00   .      0      0
#  [2,]   0.00   1.60   0.00   .      0      0
#  [3,]  -0.38   0.00   0.00   .      0      0
#  [4,]   0.00   0.00   0.00   .      0      0
#  [5,]   0.00   0.00   0.00   .      0      0
#   ...      .      .      .   .      .      .
# [46,]      0      0      0   .    0.0    0.0
# [47,]      0      0      0   .    0.0    0.0
# [48,]      0      0      0   .    0.0    0.0
# [49,]      0      0      0   .   -1.1    0.0
# [50,]      0      0      0   .    0.0    0.0

## Sanity check:
identical(as(tenx, "dgCMatrix"), t(m0))

h5ls(path(tenx))
#   group    name       otype  dclass dim
# 0     /      m0   H5I_GROUP            
# 1   /m0    data H5I_DATASET   FLOAT 500
# 2   /m0 indices H5I_DATASET INTEGER 500
# 3   /m0  indptr H5I_DATASET INTEGER 101
# 4   /m0   shape H5I_DATASET INTEGER   2

Hope this helps, H.

hpages commented 2 years ago

BTW it's generally better to ask for this kind of help on the Bioconductor support site: https://support.bioconductor.org/

Thanks, H.

ekernf01 commented 2 years ago

Thank you, this is super helpful and I really appreciate it. I will close for now until/unless I still have trouble after rewriting more idiomatically.