Closed vjcitn closed 3 years ago
Creating a new HDF5Array instance by calling HDF5Array()
on the correct path will have the same effect as trying to fix the broken HDF5Array object.
Serializing an HDF5Array object has almost zero value so makes little sense in general. I guess it's in the context of serializing a bigger object that contains an HDF5Array? I wish there was a way to prevent people from doing this. If my "pre serialization & post unserialization hooks" proposal ever makes it to base R, we'll finally have a way to prevent this.
Good to know, thanks!
I serialize HDF5Array
s quite frequently as part of the caching system for building the book. It works quite well as the ExperimentHub cache location is preserved across R sessions. We also serialize HDF5Array
s on our cluster where there is a common storage location for the relevant HDF5 files, which allows different users to pass around light serialized objects that can always re-establish the correct connection to the corresponding HDF5 backend.
Sure, you can always do it and it works if the h5 file is still accessible at unserialization time and the class internals have not changed. But you're basically serializing a useless and somewhat complex S4 shell around a filepath so you're just introducing possibilities for things to go wrong. So it's still a bad idea in general. Better to recreate the HDF5Array object from the filepath each time if you can choose. The cost of creating an HDF5Array object from an existing h5 file is nothing.
I also have delayed operations on those objects, so I'd like to hold on to those.
Then you don't have an HDF5Array instance so we're talking about completely different things.
Perhaps. In practice, HDF5Array
s rarely survive real-world analyses in their pure form and are usually converted to DelayedArray
s on first contact. For example, just pulling things out of an SE will slap on dimnames.
My point was the idea of serializing an object with an embedded file path is often a reasonable thing to do; so even if base R comes through with some serialization hooks, it would be a major constraint on the utility of HDF5Array
s if we were explicitly blocked from serializing them.
I stick to my point that serializing a HDF5Array object a.k.a a DelayedArray object pointing to an h5 file and without delayed operation on it (otherwise it's not an HDF5Array) doesn't make much sense generally. That's all that was being discussed here.
Serializing DelayedArray objects that carry delayed operations is a different story and I never said it was a bad idea.
The idea of using serialization hooks here would be to warn, not to block.
An analog of serializing a HDF5Array object is serializing a TxDb object. AFAIK nobody does that, for a reason.
I daresay that no one is going out of their way to serialize HDF5Array
objects, but it is happening nonetheless. I am going to guess that at least one of situations alluded to by @vjcitn involved our attempt to build the OSCA book, which took some effort to correctly set up the ExperimentHub cache for the HDF5 files. In that case, the HDF5Array
(though by that point, it was almost certainly a DelayedArray
) was being serialized as part of knitr's chunk-wise caching scheme; this was a deliberate choice as it allows different chapters of the book to use objects from another chapter's workspace without repeating all the calculations. However, incorrect ExperimentHub configurations meant that the location of the EHub cache (and thus the HDF5 file path) changed across chapters, leading to various errors.
Yes, as I said earlier there are maybe some specific situations where HDF5Array objects get serialized behind the scene. But again: better to recreate the HDF5Array object from the filepath each time if you can choose. That's all folks!
How do we close an issue that's already closed?
Wait! I know.
We have run into certain situations where a serialized HDF5Array instance has a file path that is inconsistent with the physical location of the .h5 file. Do you have a user-level operation that can reset the path in an HDF5Array instance?