Generalizing the H5SparseMatrixSeed class to support other data locations

LTLA commented 3 years ago

I plan to store a dataframe of variant calls (as an example) in a HDF5 file. This is achieved by sticking the columns into a group and then creating 1D arrays with the various fields, e.g., the contents of the HDF5 file would look like:

/variants
  /data
    /ref
    /alt
    /cov

and so on. However, one realizes that it is also possible to represent this data as a series of sparse matrices in an SE where the rows are SNPs and the columns are patients. This is achieved by sorting the DF by the sample of origin and then SNP, and then creating an /indices and /indptr corresponding to the available sample/SNP combinations.

/variants
  /indices
  /indptr
  /data
    /ref
    /alt
    /cov

This layout is appealing as it gives us three interchangeable methods of representing the data:

As the usual DF. I believe h5read(file, "variants/data") will return a named list that is trivially converted to a DF, though we are also exploring the use of HDF5-backed columns that rely on the above format.
As a BumpyMatrix, based on the previous DF but using the indexing information in /indices matrix to expedite construction (this is achieved by directly passing a sparse matrix of consecutive integers to proxy=).
As a series of (possibly file-backed) sparse matrices. This has the side-benefit of avoiding the need to repeatedly store multiple copies of indices and indptr if they were treated as separate assays inside the file.

The last point seems like it is almost achievable with H5SparseMatrixSeed, but AFAICT the current implementation expects data to be a dataset. Hence the request: can we generalize the current class so that it can be told to look for the data values in other locations with arbitrary names, e.g., /data/ref?

This is based on the request at the end of #40, though modified after some reflection on the arbitrariness of the input names, given that the incoming DF may well have column names that clobber any of the protected fields. I know it's not a H5AD/10X file anymore, but the H5SparseMatrix name sounds general enough that it still seems appropriate for this application.

hpages commented 3 years ago

Sure. It makes sense to support the possibility to share the shape, indices, and indptr across multiple sparse matrices. IIUC you or your users are going to create these files so you'll be able to control where to put the data. How about sticking to the data group for that like in your example above? This is a clean layout and it avoids clashes with reserved names like indices or indptr.

The interface of H5SparseMatrixSeed()/H5SparseMatrix() could be something like:

H5SparseMatrixSeed(filepath, group, subdata=NULL)

where subdata is the name of a dataset in the data group (e.g. "ref").

Thoughts?

hpages commented 3 years ago

I've committed this: commit 21b81ecc00042632817e7784c106b146ecd37f69

Still experimental and not documented yet. If that works for you, and once I can put my hands on such h5 files, I'll complete with addition of the subdata argument to the H5SparseMatrix() constructor, and with documentation + unit tests.

vjcitn commented 3 years ago

Accomodation of genetic variants is interesting. Are the variants produced by the single-cell sequencing technology? Should we consider extension of VariantExperiment for this use case?

LTLA commented 1 year ago

Oops, getting back to this.

I ended up being overruled on this particular use case. Despite the storage savings and (relatively) efficient access of storing things in a sparse HDF5 array, it seems that people just can't give up their VCF files.

A shame, but oh well. Maybe someone else will find it useful.

Bioconductor / HDF5Array

Generalizing the H5SparseMatrixSeed class to support other data locations #42