Closed ekernf01 closed 2 years ago
One thing to understand is that the data must be written column by column to the 10x Genomics dataset, or by blocks of columns. These blocks of columns will correspond to blocks of rows in the input file (CSV). So the final TENxMatrix object will be presented as a transposed dataset with respect to the CSV file.
As a side note you'll avoid a lot of headaches if you use TENxRealizationSink()
and write_block()
in a more idiomatic way. In particular it's much preferable to use a grid for the main loop. There are many examples in ?write_block
for how to do this. In your particular case, it's going to look something like this:
library(HDF5Array)
csvToTENxMatrix <- function(input_file, input_dim, output_file, group, input_block_nrow,
force=FALSE, verbose=FALSE, ...)
{
if (file.exists(output_file)) {
if (!force)
stop("File exists. Pass force=TRUE to overwrite.")
if (unlink(output_file) != 0L)
stop("failed to delete file \"", output_file, "\"")
}
input_connection <- gzfile(input_file, "r")
output_dim <- rev(as.integer(input_dim))
sink <- HDF5Array::TENxRealizationSink(output_dim, filepath=output_file, group=group)
sink_grid <- colAutoGrid(sink, ncol=input_block_nrow)
## Walk on the grid, and, for each viewport, read a block from the CSV file and
## write it to the h5 file.
for (bid in seq_along(sink_grid)) {
viewport <- sink_grid[[bid]]
if (verbose)
cat("Reading row(s) ", as.character(ranges(viewport)[2]), " from ", input_file, " ... ", sep="")
is_first_block <- bid == 1L
input_block <- read.csv(input_connection, nrows=width(viewport)[2], header=is_first_block, ...)
if (ncol(input_block) <= 1L)
stop("No delimiters found. Wrong delimiter?")
if (verbose)
cat("OK\n")
output_block <- t(as.matrix(input_block))
if (verbose)
cat("Writing column(s) ", as.character(ranges(viewport)[2]), " to ", output_file, " ... ", sep="")
sink <- write_block(sink, viewport, output_block)
if (verbose)
cat("OK\n")
}
close(sink)
close(input_connection)
as(sink, "DelayedArray")
}
Then:
m0 <- Matrix::rsparsematrix(100, 50, density=0.1)
write.csv(as.matrix(m0), "test.csv", row.names=FALSE)
tenx <- csvToTENxMatrix("test.csv", input_dim=dim(m0),
output_file="test.h5", group="m0",
input_block_nrow=30, verbose=TRUE)
# Reading row(s) 1-30 from test.csv ... OK
# Writing column(s) 1-30 to test.h5 ... OK
# Reading row(s) 31-60 from test.csv ... OK
# Writing column(s) 31-60 to test.h5 ... OK
# Reading row(s) 61-90 from test.csv ... OK
# Writing column(s) 61-90 to test.h5 ... OK
# Reading row(s) 91-100 from test.csv ... OK
# Writing column(s) 91-100 to test.h5 ... OK
tenx
# <50 x 100> sparse matrix of class TENxMatrix and type "double":
# [,1] [,2] [,3] ... [,99] [,100]
# [1,] 0.00 0.00 0.00 . 0 0
# [2,] 0.00 1.60 0.00 . 0 0
# [3,] -0.38 0.00 0.00 . 0 0
# [4,] 0.00 0.00 0.00 . 0 0
# [5,] 0.00 0.00 0.00 . 0 0
# ... . . . . . .
# [46,] 0 0 0 . 0.0 0.0
# [47,] 0 0 0 . 0.0 0.0
# [48,] 0 0 0 . 0.0 0.0
# [49,] 0 0 0 . -1.1 0.0
# [50,] 0 0 0 . 0.0 0.0
## Sanity check:
identical(as(tenx, "dgCMatrix"), t(m0))
h5ls(path(tenx))
# group name otype dclass dim
# 0 / m0 H5I_GROUP
# 1 /m0 data H5I_DATASET FLOAT 500
# 2 /m0 indices H5I_DATASET INTEGER 500
# 3 /m0 indptr H5I_DATASET INTEGER 101
# 4 /m0 shape H5I_DATASET INTEGER 2
Hope this helps, H.
BTW it's generally better to ask for this kind of help on the Bioconductor support site: https://support.bioconductor.org/
Thanks, H.
Thank you, this is super helpful and I really appreciate it. I will close for now until/unless I still have trouble after rewriting more idiomatically.
I want to load a file as the backend for a DelayedArray:
test.h5.gz
Trying this command:
... this error shows. Can you help me figure out where the issue is? Some more info lies below. Thanks!
It reads successfully using rhdf5.
I created it mostly using
HDF5Array::write_block
. Here is the function I used.Here's my session info.