Ideas for improvement to saveHDF5SummarizedExperiment()

PeteHaitch commented 6 years ago

Following up on our chat at BioC2018 (ping @kasperdanielhansen)

It'd be great if saveHDF5SummarizedExperiment() didn't re-write the HDF5 file when it's not necessary.

My use case is when writing a function, call it makeHDF5SE(), that creates an HDF5-backed SummarizedExperiment. If makeHDF5SE() satisfies some minimal contract (e.g., all assay data are written to a file called assays.h5) then calling saveHDF5SummarizedExperiment() as the last step of makeHDF5SE() should just create the se.rds file in the same directory. Then, the saved se.rds and assays.h5 should then be compatible with loadHDF5SummarizedExperiment() when run from a new session.

A concrete example where I'd like to use this is to replace https://github.com/hansenlab/bsseq/blob/65e163c3dacce307a88ba21401a91b7ed1aba876/R/read.bismark.R#L543-L549 by saveHDF5SummarizedExperiment(bsseq, dir = dir, replace = replace) (note how I've mimicked the other arguments of saveHDF5SummarizedExperiment() further up in the function https://github.com/hansenlab/bsseq/blob/65e163c3dacce307a88ba21401a91b7ed1aba876/R/read.bismark.R#L362-L365 and that these are used elsewhere inside read.bismark()).

PeteHaitch commented 6 years ago

Perhaps it can be smarter still. Say, you've loaded an object with loadHDF5SummarizedEzperiment(), modified the coldData and now want to save the updated object. It seems desirable to only 'update' se.rds and not have to re-save assays.h5.

hpages commented 5 years ago

@PeteHaitch

Hi Pete,

I finally tackled this. My approach was to introduce a new function for this. See ?quickResaveHDF5SummarizedExperiment in HDF5Array 1.11.8. Here is what it says:

quickResaveHDF5SummarizedExperiment preserves the HDF5 file and datasets that the assays in x are already pointing to (and which were created by an earlier call to saveHDF5SummarizedExperiment). All it does is re-serialize x on top of the .rds file that is associated with this HDF5 file (and which was created by an earlier call to saveHDF5SummarizedExperiment or quickResaveHDF5SummarizedExperiment). Because the delayed operations possibly carried by the assays in x are not realized, this is very fast.

It covers many use cases e.g. it can be used after modifying the rowData, colData, or dimnames, or after adding a rowRanges component, or performing delayed operations on the assays. See the last example in the man page. Hopefully this covers the use cases you had in mind.

H.

hpages commented 5 years ago

@PeteHaitch I'm closing this. Feel free to reopen if you feel the use cases you had in mind are not covered by quickResaveHDF5SummarizedExperiment().

kasperdanielhansen commented 5 years ago

Question now that I look at it again. Why is this a new function and not just something that works with saveXX. Perhaps with an argument forceSave to force it to overwrite?

On Wed, Jan 30, 2019 at 7:22 PM hpages notifications@github.com wrote:

@PeteHaitch https://github.com/PeteHaitch I'm closing this. Feel free to reopen if you feel the use cases you had in mind are not covered by quickResaveHDF5SummarizedExperiment().

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Bioconductor/HDF5Array/issues/10#issuecomment-459165087, or mute the thread https://github.com/notifications/unsubscribe-auth/AEuhnxn7cmSrGhjCaF85UMihqt2YFyP3ks5vIjdFgaJpZM4V_F6e .

hpages commented 5 years ago

I initially considered that but after realizing that the 2 functions do a quite different job, share very little code, and none of all the extra arguments supported by saveHDF5SummarizedExperiment() was making sense for quickResaveHDF5SummarizedExperiment(), I found it cleaner to separate the 2 things.

Bioconductor / HDF5Array

Ideas for improvement to saveHDF5SummarizedExperiment() #10