Bioconductor / HDF5Array

HDF5 backend for DelayedArray objects
https://bioconductor.org/packages/HDF5Array
9 stars 13 forks source link

Open up API for loadHDF5SummarizedExperiment #11

Closed LiNk-NY closed 5 years ago

LiNk-NY commented 5 years ago

Hi Hervé @hpages

Could you please open up the API for loadHDF5SummarizedExperiment?

Currently, only the dir argument is availble but it would be good to have an option to provide the component datasets such as assays.h5 and se.rds with different file names.

I'm working with a number of these datasets and I have to rename them so that they are unique. It would be easier to just feed these names into the function.

I could work on a PR for this. Thanks!

Cordialement, Marcel

hpages commented 5 years ago

Hi Marcel,

Just to clarify: saveHDF5SummarizedExperiment() stores an HDF5-based SummarizedExperiment as a bundle of stuff that goes into its own folder. The resulting folder is standalone and can be turned into a tarball that can be conveniently passed along between collaborators. This is a little bit like what happens when you save a web page on your local disk. The details of what stuff exactly goes in the bundle should be considered internal business and could change in the future. If you saved several HDF5-based SummarizedExperiment objects with saveHDF5SummarizedExperiment(), you should end up with one folder per object. You shouldn't need to rename anything inside those folders.

Isn't the "1 object = 1 folder" scheme working for you?

H.

LiNk-NY commented 5 years ago

The folder / tarball concept isn't very compatible with ExperimentHub S3 AFAICT. In order to keep all files from mixing (all files go into one folder on S3), I had to rename them. I can write a wrapper to rename my files in to the standard "assays.h5" and "se.rds" names. But it would be good to have more loading options.

Thanks!

mtmorgan commented 5 years ago

or store the tarball on ExperimentHub, which I think is how Herve envisions these as portable... So hub[["EH123"]] would download & cache the tarball (first time), then untar (to tempdir) and import subsequently.

Or as TENxBrainData does, access the HDF5 file and the row / colData as separate ExperimentHub resources and just recreate with SummarizedExperiment() with the assay data as HDF5Array() from the file.

hpages commented 5 years ago

Putting only the assays in an HDF5 file on ExperimentHub and reconstructing the SummarizedExperiment object in the user session is definitely the preferred way to go when there is no metadata to "remember". That way you don't even need to serialize S4 things. But I guess you're in a situation where you have some important row / colData that you also need to save somewhere.

Orthogonal to Martin's suggestions, I could add a prefix argument to saveHDF5SummarizedExperiment()/loadHDF5SummarizedExperiment(). That way you could save several bundles in the same folder while still remaining agnostic about the exact composition of the bundles. Would that help?

hpages commented 5 years ago

prefix arg added to HDF5Array 1.11.7: https://github.com/Bioconductor/HDF5Array/commit/5f34372a415a9190a6379d7619f34d53ea71769f

LiNk-NY commented 5 years ago

Thanks Hervé