Closed LTLA closed 1 year ago
Sure. It makes sense to support the possibility to share the shape, indices
, and indptr
across multiple sparse matrices. IIUC you or your users are going to create these files so you'll be able to control where to put the data. How about sticking to the data
group for that like in your example above? This is a clean layout and it avoids clashes with reserved names like indices
or indptr
.
The interface of H5SparseMatrixSeed()
/H5SparseMatrix()
could be something like:
H5SparseMatrixSeed(filepath, group, subdata=NULL)
where subdata
is the name of a dataset in the data
group (e.g. "ref"
).
Thoughts?
I've committed this: commit 21b81ecc00042632817e7784c106b146ecd37f69
Still experimental and not documented yet. If that works for you, and once I can put my hands on such h5 files, I'll complete with addition of the subdata
argument to the H5SparseMatrix()
constructor, and with documentation + unit tests.
Accomodation of genetic variants is interesting. Are the variants produced by the single-cell sequencing technology? Should we consider extension of VariantExperiment for this use case?
Oops, getting back to this.
I ended up being overruled on this particular use case. Despite the storage savings and (relatively) efficient access of storing things in a sparse HDF5 array, it seems that people just can't give up their VCF files.
A shame, but oh well. Maybe someone else will find it useful.
I plan to store a dataframe of variant calls (as an example) in a HDF5 file. This is achieved by sticking the columns into a group and then creating 1D arrays with the various fields, e.g., the contents of the HDF5 file would look like:
and so on. However, one realizes that it is also possible to represent this data as a series of sparse matrices in an SE where the rows are SNPs and the columns are patients. This is achieved by sorting the DF by the sample of origin and then SNP, and then creating an
/indices
and/indptr
corresponding to the available sample/SNP combinations.This layout is appealing as it gives us three interchangeable methods of representing the data:
h5read(file, "variants/data")
will return a named list that is trivially converted to a DF, though we are also exploring the use of HDF5-backed columns that rely on the above format./indices
matrix to expedite construction (this is achieved by directly passing a sparse matrix of consecutive integers toproxy=
).indices
andindptr
if they were treated as separate assays inside the file.The last point seems like it is almost achievable with
H5SparseMatrixSeed
, but AFAICT the current implementation expectsdata
to be a dataset. Hence the request: can we generalize the current class so that it can be told to look for the data values in other locations with arbitrary names, e.g.,/data/ref
?This is based on the request at the end of #40, though modified after some reflection on the arbitrariness of the input names, given that the incoming DF may well have column names that clobber any of the protected fields. I know it's not a H5AD/10X file anymore, but the
H5SparseMatrix
name sounds general enough that it still seems appropriate for this application.