Bioconductor / HDF5Array

HDF5 backend for DelayedArray objects
https://bioconductor.org/packages/HDF5Array
9 stars 13 forks source link

reset_h5_location operation? #31

Closed vjcitn closed 3 years ago

vjcitn commented 3 years ago

We have run into certain situations where a serialized HDF5Array instance has a file path that is inconsistent with the physical location of the .h5 file. Do you have a user-level operation that can reset the path in an HDF5Array instance?

hpages commented 3 years ago

Creating a new HDF5Array instance by calling HDF5Array() on the correct path will have the same effect as trying to fix the broken HDF5Array object.

Serializing an HDF5Array object has almost zero value so makes little sense in general. I guess it's in the context of serializing a bigger object that contains an HDF5Array? I wish there was a way to prevent people from doing this. If my "pre serialization & post unserialization hooks" proposal ever makes it to base R, we'll finally have a way to prevent this.

vjcitn commented 3 years ago

Good to know, thanks!

LTLA commented 3 years ago

I serialize HDF5Arrays quite frequently as part of the caching system for building the book. It works quite well as the ExperimentHub cache location is preserved across R sessions. We also serialize HDF5Arrays on our cluster where there is a common storage location for the relevant HDF5 files, which allows different users to pass around light serialized objects that can always re-establish the correct connection to the corresponding HDF5 backend.

hpages commented 3 years ago

Sure, you can always do it and it works if the h5 file is still accessible at unserialization time and the class internals have not changed. But you're basically serializing a useless and somewhat complex S4 shell around a filepath so you're just introducing possibilities for things to go wrong. So it's still a bad idea in general. Better to recreate the HDF5Array object from the filepath each time if you can choose. The cost of creating an HDF5Array object from an existing h5 file is nothing.

LTLA commented 3 years ago

I also have delayed operations on those objects, so I'd like to hold on to those.

hpages commented 3 years ago

Then you don't have an HDF5Array instance so we're talking about completely different things.

LTLA commented 3 years ago

Perhaps. In practice, HDF5Arrays rarely survive real-world analyses in their pure form and are usually converted to DelayedArrays on first contact. For example, just pulling things out of an SE will slap on dimnames.

My point was the idea of serializing an object with an embedded file path is often a reasonable thing to do; so even if base R comes through with some serialization hooks, it would be a major constraint on the utility of HDF5Arrays if we were explicitly blocked from serializing them.

hpages commented 3 years ago

I stick to my point that serializing a HDF5Array object a.k.a a DelayedArray object pointing to an h5 file and without delayed operation on it (otherwise it's not an HDF5Array) doesn't make much sense generally. That's all that was being discussed here.

Serializing DelayedArray objects that carry delayed operations is a different story and I never said it was a bad idea.

The idea of using serialization hooks here would be to warn, not to block.

hpages commented 3 years ago

An analog of serializing a HDF5Array object is serializing a TxDb object. AFAIK nobody does that, for a reason.

LTLA commented 3 years ago

I daresay that no one is going out of their way to serialize HDF5Array objects, but it is happening nonetheless. I am going to guess that at least one of situations alluded to by @vjcitn involved our attempt to build the OSCA book, which took some effort to correctly set up the ExperimentHub cache for the HDF5 files. In that case, the HDF5Array (though by that point, it was almost certainly a DelayedArray) was being serialized as part of knitr's chunk-wise caching scheme; this was a deliberate choice as it allows different chapters of the book to use objects from another chapter's workspace without repeating all the calculations. However, incorrect ExperimentHub configurations meant that the location of the EHub cache (and thus the HDF5 file path) changed across chapters, leading to various errors.

hpages commented 3 years ago

Yes, as I said earlier there are maybe some specific situations where HDF5Array objects get serialized behind the scene. But again: better to recreate the HDF5Array object from the filepath each time if you can choose. That's all folks!

How do we close an issue that's already closed?

hpages commented 3 years ago

Wait! I know.